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computer  programs  that  use  real  arithmetic  (hereinafter  referred  to  as  "mathematical 
programs'*)  is  that  the  mathematical  properties  of  real  arithmetic  operations  in  compu¬ 
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Chapter  1 
Introduction 


This  is  the  final  report  for  the  Reals  project’s  Task  5.  It  follows  Chapter 
3  of  the  Task’s  interim  report  [ORA88]  and  describes  different  methods  for 
representing  real  numbers  and  performing  operations  on  them  in  computers. 
It  summarizes  and  evaluates  significant  ideas  from  the  technical  literature  on 
computer  arithmetic  and  interval  analysis,  relates  results  in  interval  analysis 
to  the  Reals  project’s  work  on  asymptotic  correctness,  presents  empirical 
results  on  nedve  and  sophisticated  interval  algorithms  and  on  VAX  and  IEEE- 
standard  floating-point  arithmetic,  and  lists  questions  for  future  research. 

The  representation  ideas  from  the  literature  include  one  we  consider  su¬ 
perior  to  standard  floating-point  for  command  and  control  applications.  The 
questions  for  future  research  include  new  proposals  for  representing  the  real 
numbers  and  an  open  theoretical  question  on  representing  the  integers. 

In  our  interim  report,  we  classified  error  in  computer  calculations  as  ei¬ 
ther  input  error,  modeling  error,  truncation  error  or  computational  error.  We 
noted  that  different  methods  for  representing  real  numbers  and  the  elemen¬ 
tary  operations  on  them  can  only  affect  the  computationad  error,  which  is 
often  insigniflcamt  compared  to  other  parts  of  the  total  error.  We  also  noted 
that,  in  order  to  maintain  reasonable  computation  speed  and  have  stored 
values  occupy  reasonably  small  areas  in  computer  memory,  a  representation 
system  must  discaurd  information  and  hence  introduce  computational  error. 

We  described  interval  analysis,  a  technique  that  maintains  intervals  whose 
endpoints  are  conservative  upper  and  lower  bounds  on  all  input  and  com- 
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puted  quantities,  as  a  method  for  effectively  eliminating  error  and  replacing 
it  with  uncertainty,  uncertainty  reflected  in  the  lengths  of  the  intervals  input 
or  computed.  Chapter  2  describes  our  subsequent  work  on  interval  analysis. 

Chapter  2  gives  empirical  results  on  interval  algorithms  which  show  that 
naive  interval  analysis  is  unlikely  to  be  useful  but  that  more  sophisticated 
interval  algorithms  can  be  surprisingly  effective.  The  empirical  results  use  a 
version  of  interval  arithmetic  that  takes  advantage  of  characteristics  of  lEEEl- 
standaxd  floating-point  arithmetic  [IEE85].  Chapter  2  briefly  summarizes 
the  Reals  project’s  notion  of  asymptotic  correctness  [ORA87],  describes  a 
possible  extension  of  this  notion  to  interval  arithmetic,  and  gives  the  results 
of  our  efforts  to  relate  interval  asymptotic  correctness  to  scalar  asymptotic 
correctness.  Chapter  2  also  describes  a  technique  from  Matijasevich  [Mat85] 
for  deahng  with  a  basic  defect  of  interval  arithmetic  that  causes  it  to  be 
overly  conservative. 

Our  interim  report  noted  that  floating-point  arithmetic  has  advantages 
that  should  not  be  neglected  when  considering  alternative  representations, 
and  expressed  the  hope  that  using  highly  accurate  floating-point  arithmetic 
would  be  a  practical  way  to  make  the  asymptotic  model  accurate.  Chapter 
3  describes  our  subsequent  work  on  floating-point  arithmetic. 

Chapter  3  gives  examples  showing  the  strengths  and  weaknesses  of  the 
asymptotic  model,  and  discusses  the  costs  and  benefits  of  making  it  accurate. 
Chapter  3  sununarizes  results  from  the  technical  literature  on  the  time,  space 
and  complexity  costs  of  highly  accurate  floating-point  arithmetic,  particu¬ 
larly  floating-point  arithmetic  compatible  with  the  IEEE  standard.  Chapter 
3  also  describes  results  from  the  literature  on  on-line  versions  of  floating-point 
arithmetic  that  facilitate  doing  parallel  computation. 

Our  interim  report  reported  that  we  would  no  longer  pursue  the  Combina¬ 
torial  Representation,  which  used  algebraic  topology,  originally  investigated 
by  Task  5,  but  would  instead  study  and  evaluate  other  alternative  represen¬ 
tation  systems  for  computer  arithmetic  given  in  the  engineering  literature. 
Chapters  4  and  5  describe  results  from  the  literature. 

Chapter  4  describes  mathematical  ideas  used  by  the  alternative  represen¬ 
tation  systems  considered  iij  Chapter  5.  These  ideas  include  approximate 
rational  arithmetic,  mediant  rounding,  standard  and  generalized  continued 
fr2w:tions,  and  Gosper’s  algorithm  for  doing  arithmetic  on  continued  fractions. 
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Chapter  5  summarizes  and  comments  on  the  following  proposed  represen¬ 
tation  systems:  The  fixed-slash  and  floating-slcish  representations  by  Matula 
and  Kornerup  [MK80,KM81,KM83a,MK85];  the  binary-coded  lexicographic 
continued  fraction  representation  by  Matula  and  Kornerup  [MK83,KM85, 
KM87,KM88];  the  hybrid  fixed-slash  and  floating-point  representation  by 
Hwang  and  Chang  [HC78];  the  variable-length-exponent  representation  by 
Iri  and  Matsui  [MI81];  the  repeating-mantissa  floating-point  representation 
by  Yoshida  [Yos83];  the  hyper-exponential  representation  by  Olver  and  Clen- 
shaw  [01v87];  and  the  finite  p-adic  representation  by  Gregory  and  Krishna- 
murty  [GK84]. 

The  approximate  rational  arithmetic  systems  from  Matula  and  Kornerup 
show  that  our  plan,  given  in  our  interim  report,  to  only  consider  aJternative 
representation  systems  as  means  for  specifying  the  endpoints  of  intervals, 
was  overly  restrictive.  Chapter  5’s  comments  note  for  each  representation 
whether  it  is  suitable  for  representing  the  endpoints  of  intervals  and  whether 
it  facilitates  parallel  computation.  Chapter  5  also  includes  remarks  about 
the  empirical  evidence  from  Matula  and  Ferguson  [FM85]  supporting  the 
floating-slash  representation  system. 

Our  interim  report  described  constructive-real  representation  systems 
that  make  it  possible  to  do  calculations  to  a  user-specified  or  data-determined 
degree  of  accuracy.  Our  interim  report  noted  that  these  systems  could  not 
be  used  for  typical  real-time  computation  applications  because  they  do  not 
discard  enough  information,  but  also  noted  that  they  might  be  useful  for 
calculations  requiring  unpredictable  amounts  of  accuracy  in  intermediate  re¬ 
sults.  The  constructive-real  systems  described  in  our  interim  report  included 
a  “laaily-evaluated  continued  fraction”  system  implemented  by  Jones  [Jon84] 
ajid  a  “real  as  the  limit  of  rationals”  system  by  Boehm  [Boe87].  Chapter  6 
describes  our  work  with  constructive-real  representation  systems. 

Chapter  6  lists  general  properties  of  constructive-real  representation  sys¬ 
tems,  including  what  we  learned  about  their  suitability  for  real-time  appli¬ 
cations.  It  gives  Caliban  programs  for  computing  standard  and  generalized 
continued  fraction  expansions  of  rationals,  and  results  on  using  Gosper’s  algo¬ 
rithm  to  perform  arithmetic  on  standard  and  generalized  continued  fractions. 
It  also  gives  additional  information  on  Boehm’s  system  of  constructive-real 
arithmetic. 
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Chapter  7  summarizes  the  conclusions  of  this  report.  It  gives  our  own 
suggestions  for  alternative  representation  systems  and  raises  questions  for 
future  research,  particularly  a  question  about  the  theoretical  limits  of  repre¬ 
sentations  of  the  integers. 

Finally,  Chapter  8  ties  up  loose  ends  from  Task  5.  It  notes  plans  or 
further  studies  that  we  were  unable  to  carry  out,  and  corrects  two  errors  in 
our  interim  report. 
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Chapter  2 

Interval  Analysis 


This  chapter  summarizes  our  work  on  interval  analysis.  It  first  defines  in¬ 
terval  analysis  and  terminology  used  later  in  the  chapter,  then  reviews  ele¬ 
mentary  relationships  between  interval  and  scalar  arithmetic.  It  next  gives 
quantitative  results  comparing  naive  interval  computations  to  scalar  ones, 
and  contrasting  the  accuracy  of  naive  and  sophisticated  interval  algorithms. 
The  chapter  then  describes  the  results  of  our  efforts  to  define  a  notion  of 
asymptotic  correctness  for  interval  algorithms  and  to  relate  proofs  of  interval 
and  scalar  asymptotic  correctness.  Finally,  the  chapter  describes  a  proposal 
by  Matijasevich  [Mat85]  for  avoiding  one  of  the  major  problems  with  interval 
analysis. 

2.1  Basic  Definitions 

Interval  analysis  maintains  exact  bounds  on  the  absolute  error  in  all  data 
values,  and  treats  quantities  as  being  known  only  to  belong  to  intervals. 
The  operations  of  interval  arithmetic  act  on  intervals  and  produce  intervals 
that  contain  the  results  of  the  corresponding  operations  on  all  real  numbers 
contained  in  the  original  intervals. 

We  will  refer  to  real  numbers  as  scalar  values,  to  operations  on  real  num¬ 
bers  as  scalar  operations,  and  to  programs  that  use  a  fixed  number  representa¬ 
tion  system’s  values  and  operations  —  e.g.,  lEEEl-standard  double- precision 


floating-point  arithmetic  —  instead  of  intervals  and  interval  operations  as 
scalar  programs.  Let  M  be  the  set  of  values  representable  in  the  fixed  repre¬ 
sentation  system.  If  the  endpoints  (possibly  -poo  or  — oo)  of  an  interval  are 
in  M,  call  the  interval  machine-representable. 

Let  a  rounding  be  a  function  □  ;  R  U  {-f oo,  —00}  — y  M  satisfying: 

Vx  6  M  (Di  =  x) 

Vx,y  G  R  (x  <  j/  =>  Dx  <  Dy) 

Call  a  rounding  O  upwardly  directed  if  Ox  >  x  and  downwardly  directed 
if  Dx  <  X.  Let  I  and  J,  be  upwardly  and  downwardly  directed  roundings, 
respectively.  Define  a  conservative  rounding  on  intervals  by 

I  A  =  J  [01,02]  =  [i  ai, I  02]. 


For  ★  G  {+!“)  ■)/}  ^nd  intervals  A  and  R,  if  o  *  6  is  defined  for  every 
o  G  v4  and  b  ^  B.,  define  the  interval  operation  A*  B  hy  letting 

A  *  B  =  {o  ★  6|a  G  .4,  I>  G  B), 

and  let  A*B  be  undefined  ot.ierwise.  Define  the  machine  interval  arithmetic 
operations  for  ★  G  {+1— ,•)/}  on  machine-representable  intervals  A  and  B 
by  letting 

A  B  =  I  (A  ★  B) 

whenever  the  resulting  interval  is  defined,  and  letting  it  be  undefined  oth¬ 
erwise.  We  will  refer  to  programs  using  machine-representable  intervals  and 
machine  interval  arithmetic  operations,  which  we  will  usually  refer  to  as  in¬ 
tervals  and  interval  arithmetic,  as  interval  programs. 

Appendix  A  describes  a  form  of  interval  arithmetic  for  intervals  whose 
endpoints  are  double-precision,  IEEE-standard  floating-point  values.  Ap¬ 
pendix  B  gives  an  implementation  of  this  interval  arithmetic. 

Aberth  [AbeSS]  describes  a  version  of  interval  arithmetic  called  range 
arithmetic  that  represents  an  interval  as  a  midpoint  and  a  distance,  called  a 
range,  from  this  midpoint  to  the  interval’s  endpoints.  The  operations  of  range 
arithmetic  produce  interval  bounds  that  are  not  quite  as  tight  as  the  bounds 
produced  by  conservative  roundings  to  machine-representable  endpoints,  but 
the  operations  of  range  arithmetic  are  simpler  and  faster. 
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2.2  Elementary  Properties 


The  great  advantage  of  interval  analysis  is  that  it  recognizes  and  bounds 
input,  truncation  and  computational  error,  bounding  every  form  of  error  ex¬ 
cept  modeling  error.  Interval  results  definitely  contain  the  desired  quantities, 
and  the  lengths  of  the  intervals  clearly  show  the  total  uncertainty  in  these 
quantities.  The  results  of  scalar  programs,  by  contrast,  are  typically  close 
to  the  desired  quantities,  but  not  known  to  be  higher,  lower  or  exact,  and 
nothing  shows  how  uncertain  they  really  are.  Interval  analysis  pays  for  this 
great  advantage  with  several  disadvantages; 

Interval  arithmetic  requires  roughly  twice  as  much  calculation  as  scalar 
arithmetic,  though  the  cost  of  this  can  be  greatly  reduced  by  performing 
many  of  the  necessary  calculations  in  parallel.  The  cost  can  also  be  reduced 
by  using  range  arithmetic. 

Interval  arithmetic  requires  computing,  in  hardware  or  software,  appropri¬ 
ate  upward  and  downward  rounding  functions.  IEEE-standard  floating-point 
hardware  includes  these  rounding  functions,  but  they  complicate  it  and  all 
other  representation  systems  that  provide  them.  Range  arithmetic,  which 
produces  slightly  looser  bounds,  does  not  require  these  functions;  Aberth’s 
[Abe88]  implementation  of  interval  arithmetic  runs  on  IBM  machines  where 
IEEE-standard  floating-point  is  not  available. 

There  are  also  significant  problems  with  interval  arithmetic  that  cannot 
be  eliminated  with  more  accurate  representations  of  intervals’  endpoints  or 
more  accurate  operations  on  these  endpoints.  Interval  arithmetic  does  not 
acknowledge,  for  instance,  that  input  and  computation  errors  are  usually  less 
than  their  extreme  values  and  often  cancel  out.  This  is  the  main  reason  that 
input  and  calculation  errors  do  not  cause  problems  more  often  than  they  do. 
If  n  uniformly-distributed  random  error  variables  are  added  together,  which 
corresponds  to  adding  n  intervals,  the  worst-case  error  is  proportional  to  n, 
but  for  large  n  the  error  variance  is  roughly  proportional  to  y/n. 

This  defect  is  why  our  plan  to  consider  alternative  representation  systems 
only  as  means  for  representing  the  endpoints  of  intervals  was  overly  restric¬ 
tive.  Some  of  the  representation  systems  described  in  Chapter  5,  particularly 
the  ones  that  use  mediant  rounding,  are  intended  to  exploit  simplicity  in 
computed  results  so  that  accumulated  errors  frequently  cancel  out. 
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Similarly,  interval  arithmetic  does  not  reflect  the  relationships  between 
errors  in  different  computed  values,  so  its  computed  bounds  on  these  errors 
are  often  far  too  pessimistic.  If  a  real  number  x  is  known  only  to  lie  in  the 
interval  [—1,1],  for  example,  interval  multiplication  would  say  of  only  that 
it  lies  in  [—1, 1],  when  actually  it  must  lie  in  [0, 1].  This  problem  cannot  be 
solved  by  looking  at  individual  operations  and  their  arguments:  If  two  real 
numbers  i  and  y  are  both  known  only  to  lie  in  [—1,1],  for  example,  but 
nothing  is  known  about  the  relationship  between  x  and  y,  then  the  product 
X  •  y  is  correctly  only  known  to  he  in  [—1, 1].  This  is  the  problem  addressed 
by  Matijasevich’s  ideas  described  in  Section  2.5. 

Operations  other  th«in  the  basic  operations  of  interval  arithmetic  can 
sometimes  be  used  to  make  deductions  about  relationships  between  the  errors 
in  different  quantities.  If  a  quantity  is  known  to  fall  in  two  intervals,  for 
example,  it  is  known  to  fall  in  their  intersection.  Such  operations  are  included 
in  the  interval  arithmetic  given  in  Appendix  B  and  used  in  the  Interval 
Newton’s  Method  program  discussed  in  Subsection  2.3.2  below. 


.2.3  Quantitative  Results 

j'his  section  describes  quantitative  results  for  a  naive  and  a  more  sophisti¬ 
cated  interval  algorithm.  The  naive  interval  algorithm,  which  is  a  simple 
translation  of  a  corresponding  scalar  algorithm,  uses  a  Fast  Fourier  Trans¬ 
form  (FFT)  to  multiply  large  integers.  The  more  sophisticated  interval  algo¬ 
rithm  uses  an  interval  version  of  Newton’s  Method  to  find  roots  of  polynomi¬ 
als.  Results  for  the  naive  algorithm  include  results  for  both  IEEE  and  VAX 
arithmetic;  results  for  the  more  sophisticated  algorithm  use  IEEE  arithmetic 
only. 

The  lEEEl-based  interval  computations  were  carried  out  with  the  interval 
axithmetic  package  given  in  Appendix  B.  All  IEEE  computations  were  car¬ 
ried  out  in  double-precision  floating-point  on  a  Sun  3/60  with  an  MC8881 
floating-point  coprocessor,  mask  A95N,  under  release  3.5  of  the  Sun  UNIX 
4.2  operating  system. 

The  VAX-based  interval  computations  were  carried  out  with  interval 
arithmetic  subroutines  in  the  code  given  in  Appendix  C,  Section  2.  VAX 
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arithmetic  [DigSl]  does  not  have  directed  roundings,  but  for  each  operation 
produces  a  result  that  is  either  the  representable  value  closest  to  the  ex¬ 
act  answer  or  the  representable  result  with  larger  absolute  value  of  the  two 
equally-close  representable  values.  VAX  arithmetic  is  thus  similar  to  IEEE 
arithmetic  [IEE85]  in  its  default  round-to-nearest  rounding  mode,  except 
that  if  there  are  two  equally-near  values  VAX  arithmetic  takes  the  one  with 
larger  absolute  value,  while  IEEE  arithmetic  takes  the  one  whose  last  bit  is  0. 
The  VAX  interval  arithmetic  subroutines  compute  absolute  upper  amd  lower 
bounds  for  intervals  by  adding  or  subtracting  the  quantity  represented  by  the 
least  significant  bit  in  approximate  upper  and  lower  bounds  computed  with 
VAX  arithmetic.  All  VAX  computations  were  carried  out  in  double-precision 
floating-point  (D_floating)  arithmetic  on  a  VAX  11/750. 

Every  result  is  given  in  both  ordinary  scientific  notation  and  in  a  “literal” 
form  that  is  a  string  of  hexidecimal  digits  specifying  the  exact  bit  pattern 
used  to  represent  the  result  in  whichever  arithmetic  system  —  either  IEEE 
or  VAX  —  is  currently  being  used.  For  the  interpretations  of  these  strings, 
see  [IEE85]  for  the  IEEE  values  and  [Dig81]  for  the  VAX  ones. 

2,3,1  FFT  Multiplication  Results 

We  implemented  scalar  and  interval  versions  of  programs,  both  for  IEEE 
and  for  VAX  arithmetic,  that  use  Fast  Fourier  Transforms  (FFTs)  to  multiply 
large  integers.  The  algorithm  we  implemented  is  described  in  Knuth  [Knu81), 
Section  4.3.3,  Part  C.  The  interv^ll  versions  of  these  programs  were  obtained 
by  replacing  floating-point  values  and  operations  by  corresponding  interval 
ones. 

High  accuracy  is  actually  not  necessary  for  this  application,  since  the  ideal 
final  results  are  known  to  be  integers  and  can  thus  be  determined  exactly 
from  floating-point  approximations  that  are  in  error  by  less  than  1/2.  Since 
the  ideal  final  results  are  known  to  be  integers,  though,  it  is  possible  to 
determine  the  error  in  the  floating-point  approximations  easily,  even  though 
these  approximations  are  only  produced  after  a  reasonably  large  amount  of 
computation.  Our  programs  were  written  to  be  run  on  integers  that  were 
small  enough  to  be  multiplied  with  machine  hardware,  and  to  produce  output 
that  was  convenient  for  showing  the  errors  in  the  floating-point  results. 
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The  interval  version  for  IEEE  arithmetic  is  given  in  Appendix  C,  Section 
1,  and  uses  the  interval-arithmetic  package  given  in  Appendix  B.  The  interval 
version  for  VAX  arithmetic  in  given  in  Appendix  C,  Section  2.  The  two  scalar 
versions  of  these  programs  were  almost  identical.  The  IEEE  version  is  given 
in  Appendix  C,  Section  3,  and  the  VAX  version  was  created  by  making  the 
changes  listed  in  comments  in  this  code. 

An  edited  version  of  the  combined  output  from  the  four  programs  when 
they  were  used  to  multiply  three  particular  pairs  of  integers  is  given  in  Ap¬ 
pendix  D.  The  editing  consisted  of  removing  redundant  descriptive  informa¬ 
tion  and  rearranging  lines  from  the  outputs  to  make  it  easier  to  compare 
them. 

These  FFT  multiplication  results  support  our  interim  report’s  comment, 
itself  consistent  with  observations  in  the  literature  [SB80],  that  interval  al¬ 
gorithms  obtained  by  simply  replacing  operations  on  real  numbers  with  the 
corresponding  operations  on  intervals  usually  produce  error  bounds  that  are 
much  too  pessimistic.  Even  for  the  IEEE  results,  which  are  correctly  rounded, 
the  length  of  the  interval  calculated  is  sometimes  over  2360  times  the  error 
in  the  corresponding  scalar  quantity.  In  three  IEEE  results,  the  scalar  quan¬ 
tity  is  actually  correct  to  all  52  of  an  IEEE  double- precision  value’s  bits  of 
precision,  while  the  corresponding  interval  reflects  uncertainty  in  as  many  as 
17  of  the  final  bits. 

The  VAX  interval  results  do  not  really  reflect  undue  conservatism  in  inter¬ 
val  bounds  because  the  computed  bounds  are  not  optimal  for  the  underlying 
floating-point  arithmetic.  These  results  do  show,  however,  that  the  greater 
accuracy  in  VAX  double-precision  over  IEEE  double- precision  ajithmetic  can 
produce  more  accurate  results  even  without  optimal  rounding.  This  is  dis¬ 
cussed  in  Chapter  3. 


2.3.2  Interval  Newton’s  Method  Results 

We  also  implemented  a  more  sophisticated  interval  algorithm,  the  Interval 
Newton’s  Method  algorithm  from  Alefeld  2md  Herzberger  [Ale86].  The  code 
for  this  algorithm  is  given  in  Appendix  E,  and  uses  the  IEEE  interval  arith¬ 
metic  package  given  in  Appendix  B. 
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This  algorithm  is  based  on  the  following  observations.  Define  the  length 
and  midpoint  functions  on  an  interval  A  =  by 


length(A)  =  aj  —  Oo  and 
mid(/l)  =  (ao  +  ai)/2. 


Suppose  /  is  a  continuous  function,  and  suppose  the  intervals  =  [xo,Xi] 
and  M  =  [mo,  mi]  are  such  that  /(xq)  <  0,  f{xi)  >0,  and  /  has  a  root  ^  in 
such  that  for  all  i  € 


0  <  mo  < 


x~( 


f{^) 

x-^ 


<  mi  <  oo. 


Define  intervals  for  all  A;  >  0  inductively  by 

In  operations  combining  a  real  number  and  an  interval,  interpret  the  real 
number  as  a  point  interval.  Then  for  all  A:  >  0, 


^  €  X^'‘\ 

X(°^DX^^^D...X(^\  and 
length(X(‘+‘))  <  i(l  -  —  )length(X(*>)  ,  so 

2  mi 

lim  =  C 

k-*oo 


The  code  in  Appendix  E  acknowledges  that  the  midpoint  of  an  inter¬ 
val  can  usually  only  be  computed  approximately,  but  in  evaluating  /  at  an 
imprecisely-known  midpoint  it  assumes  that  the  possible  range  of  values  is 
contained  in  the  interval  determined  by  evaluating  /,  with  rounding  modes 
set  appropriately,  at  the  endpoints  of  the  interval  containing  the  midpoint. 
This  assumption  is  reasonable,  since  the  interval  containing  the  midpoint  will 
either  be  a  point  interval  or  have  two  consecutive  representable  values  as  its 
endpoints. 

Samples  of  the  output  from  this  code  are  given  in  Appendix  F.  All  of  these 
results  for  the  Interval  Newton’s  Method  are  not  only  highly  accurate,  but 
perfect.  As  the  results  of  Boehm’s  [Boe87]  arbitrary- precision  calculator  and 
the  “literal”  results  show,  the  intervals  computed  are  the  shortest  possible 
m2wdiine-representable  intervals  containing  the  desired  quantities. 
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2.3.3  Summary 


These  results  show  that  a  simple-minded  application  of  interval  analysis  is 
unlikely  to  be  useful,  but  that  nontrivial  interval  algorithms  can  be  surpris¬ 
ingly  powerful  and  axe  worthy  of  further  investigation.  We  did  not  expect 
the  simple-minded  interval  algorithms,  particularly  those  using  the  control 
of  rounding  available  in  IEEE  arithmetic,  to  perform  so  poorly,  and  we  did 
not  expect  the  Interval  Newton’s  Method  algorithm  to  perform  so  well.  This 
conclusion  about  the  usefulness  of  simple-minded  interval  analysis  is  also 
supported  by  our  experience,  described  in  the  next  section,  with  trying  to 
relate  scalar  and  interval  versions  of  2isymptotic  correctness. 


2.4  Interval  Asymptotic  Correctness 

We  initially  conjectured,  since  the  results  of  interval  calculations  definitely 
coritain  desired  exact  values  while  the  results  of  scalar  ones  are  merely  usually 
<  Ig.-'C  to  these  exact  values,  that  if  a  program  could  be  proved  asymptotically 
cGi  rect  when  the  values  of  its  variables  and  operations  were  reinterpreted 
its  intervals  and  interval  operations  then  the  program  would  not  only  be 
asymptotically  correct  but  effectively  asymptoiicaWy  correct  —  i.e.,  it  would 
be  theoretically  possible  to  compute  the  degree  of  machine  accuracy  necessary 
to  have  the  program  produce  a  desired  degree  of  output  accuracy. 

In  an  attempt  to  prove  this  conjecture,  we  began  developing  a  generaliza¬ 
tion  of  the  Reals  project’s  asymptotic  semantics  [ORA87]  that  would  allow 
the  values  of  variables  to  be  either  nonstandard  real  numbers  or  intervals 
of  nonstandard  real  numbers,  and  allow  the  arithmetic  operations  on  these 
variables  to  denote  either  operations  on  nonstandard  reals  or  on  intervals  of 
nonstandard  reals.  During  this  effort,  however,  we  found  a  counterexample 
to  the  intent  of  our  conjecture. 

This  section  first  gives  an  informal  definition  of  the  notion  of  asymptotic 
correctness  and  an  informal  description  of  the  interval  version  of  it  we  investi¬ 
gated.  It  then  gives  and  explains  the  example  we  found  showing  that  merely 
reinterpreting  scalar  programs  as  interval  ones  and  proving  them  intervally 
asymptotically  correct  does  not  give  useful  information  about  the  degree  of 
machine  accuracy  necessary  to  obtain  a  desired  degree  of  output  accuraw:y. 
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2.4.1  Asymptotic  Correctness  Definitions 


Asymptotic  correctness  can  be  loosely  defined  in  standard  mathematical 
terms  as  follows;  A  program  is  asymptotically  correct  if  it  always  halts  and 
its  outputs  become  arbitrarily  accurate  as  it  is  run  on  a  sequence  of  ma¬ 
chines  whose  sets  of  machine-representable  numbers  become  progressively 
larger  and  whose  machine  arithmetic  operations  become  progressively  more 
accurate  and  progressively  less  vulnerable  to  overflow.  On  such  a  sequence  of 
machines,  arbitrary  real- number  inputs  can  thus  eventually  be  approximated 
arbitrarily  accurately,  and  arbitrary  real-number  calculations  can  eventually 
be  carried  out  arbitrarily  accurately  and  without  exceptions[ORA87]. 

For  the  practical  purpose  of  proving  programs  asymptoticaJly  correct, 
though,  it  is  more  convenient  to  define  aisymptotic  correctness  in  terms 
of  nonstandard  models  of  analysis.  With  this  definition,  it  is  possible  to 
formally  prove  programs  asymptotically  correct  from  axioms  that  axe  only 
slightly  more  complicated  than  typical  axioms  for  the  reals  numbers  and 
their  usual  operations  and  relations.  The  Reals  project  [ORA87]  developed 
this  approach  as  a  means  of  eliminating  the  most  common  bugs  in  numerical 
software. 

A  nonstandard  model  of  analysis  is  very  similar  to  the  real  numbers  with 
their  usual  arithmetic  operations  and  equality  and  order  relations,  but  it  con¬ 
tains  infinitesimal  numbers  other  than  0  which  are  smaller  in  magnitude  thain 
any  nonzero  real  number,  and  infinite  numbers  which  are  larger  in  magnitude 
than  any  real  number.  The  real  numbers,  or  standard  reals,  with  their  usual 
arithmetic  operations  and  relations  occur  as  a  substructure  of  every  nonstan¬ 
dard  model  of  analysis.  A  nonstandard  real  is  called  finite  if  it  is  bounded  by 
some  standard  real.  Two  nonstandard  reals  are  infinitesimally  close  if  their 
difference  is  an  infinitesimal.  See  introductory  textbooks  on  nonstandaxd 
analysis,  e.g.,  Hurd  [HL85],  for  more  information  about  nonstandard  models 
of  analysis. 

In  the  nonstandard  model  of  analysis  formulation  of  asymptotic  correct¬ 
ness,  every  standard  real  is  assumed  to  be  infinitesim2Jly  close  to  a  machine- 
representable  value,  every  machine  arithmetic  operation  on  arguments  that 
aire  finite,  except  division  by  an  infinitesim£Ll,  is  eissumed  to  produce  a  result 
infinitesimally  close  to  the  result  of  the  corresponding  ideal  operation,  2uid 
overflow  is  assumed  to  never  occur  for  any  operation  whose  result  is  finite. 
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The  number  of  different  states  a  program  can  assume  is  also  assumed  to  be 
bounded  by  a  nonstandard  integer,  though  this  nonstandaxd  integer  can  be 
infinite.  A  program  is  then  asymptotically  correct  if  always  halts  after  a  non¬ 
standard  integer  number  of  steps  and  its  results  are  then  infinitesimally  close 
to  ideal  values  for  the  function  or  relation  the  program  is  specified  to  com¬ 
pute.  A  program  is  asymptotically  correct  by  the  standard  definition  if  and 
only  if  it  is  asymptotically  correct  by  the  nonstandard  definition  [ORA87]. 

For  a  nonstandard  analysis  formulation  of  interval  2isymptotic  correct¬ 
ness,  assume  that  every  standard  real  is  contained  in  an  interval  of  infinitesi¬ 
mal  length  with  maxiiine-representable  endpoints,  assume  that  every  interval 
operation  on  intervals  with  finite  endpoints,  except  division  by  an  interval 
containing  an  infinitesimal,  produces  an  interval  whose  endpoints  differ  only 
infinitesimally  from  the  endpoints  of  the  interval  produced  by  the  correspond¬ 
ing  ideal  interval  operation,  and  assume  that  overflow  never  occurs  on  any 
interval  operation  that  produces  an  interval  whose  endpoints  are  finite.  A 
program  is  then  intervally  asymptotically  correct  if  it  haJts  after  a  nonstan¬ 
dard  integer  number  of  steps  and  its  results  are  then  intervals  of  infinitesimal 
length  containing  ideal  values  for  the  function  or  relation  the  program  is  spec¬ 
ified  to  compute. 

2.4.2  Conjecture  Problem  Example 

The  program  given  in  Appendix  G  gives  an  example  of  the  problem  we  found 
with  relating  the  original  scalar  version  of  asymptotic  correctness  to  the  in¬ 
terval  version  just  defined.  This  program  computes  tt  with  a  power  series.  Its 
variables  high  and  low  contain  computed  approximations  to  initial  segments 
of  the  power  series.  If  they  were  computed  exactly,  high  emd  low  would 
be  definitely  greater  than  and  definitely  less  than  tt,  respectively,  the  value 
of  high  would  decrease  monotonically,  and  the  value  of  low  would  inc^e^lse 
monotonically.  The  program  runs  until  low  is  at  least  as  large  as  high,  or 
the  newly-computed  value  of  high  is  not  strictly  less  than  high’s  previous 
value,  or  the  newly-computed  value  of  low  is  not  strictly  greater  them  low’s 
previous  value. 

With  a  formalization  of  IEEE  arithmetic  in  a  nonstandard  model  of  anal¬ 
ysis,  this  program  is  provably  asymptotically  correct.  Such  a  proof  shows 
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that,  after  a  nonstandard  (infinite)  integer  number  of  steps,  either  one  of  the 
variables  m,  pow5  or  pow239  overflows  to  +00,  or  the  accumulated  errors  in 
computation  cause  low  to  equal  or  exceed  high.  In  all  cases,  since  division 
by  +00  gives  0  as  its  result  and  causes  values  of  high  and  low  to  be  equal  or 
unchanged,  the  program’s  loop  termination  condition  is  met  and  the  program 
halts  with  both  high  and  low  infinitesimally  close  to  tt. 

Note  that  the  possibilities  that  arise  in  the  nonstandard  proof  of  asymp¬ 
totic  correctness  reflect  possibilities  on  real  machines.  On  most  machines 
with  IEEE  arithmetic,  for  example,  the  variable  pow239  will  overflow  to  -f  00 
before  the  loop  terminates,  and  that  will  cause  computed  values  for  high  and 
low  to  be  equal.  On  machines  without  IEEE  arithmetic,  the  program  has  to 
be  rewritten  so  that  the  loop  terminates  if  either  its  termination  condition  is 
met  or  if  overflow  occurs. 

If  the  program  were  reinterpreted  to  make  the  values  of  the  variables  inter¬ 
vals  of  nonstandard  real  numbers,  and  to  make  machine  arithmetic  operations 
the  corresponding  machine  operations  on  these  intervals,  the  program  would 
terminate  when  the  computed  intervals  for  high  and  low  intersected,  when 
a  newly-computed  interval  for  high  intersected  the  previously-computed  in¬ 
terval  for  high,  or  when  a  newly-computed  interval  for  low  intersected  the 
previously-computed  interval  for  low.  With  the  interval  interpretation,  the 
program  could  be  proved  to  terminate  after  going  through  its  loop  a  nonstan¬ 
dard  (infinite)  integer  number  of  times,  with  both  of  the  intervals  assigned  to 
the  variables  high  and  low  being  of  infinitesimal  length  and  both  containing 
points  infinitesimally  close  to  tt. 

The  critical  problem  with  the  proof  of  interval  asymptotic  correctness 
would  be  that  it  would  not  determine  which,  if  either,  of  the  two  intervals 
assigned  to  the  variables  high  and  low  actually  contains  n.  If  the  program 
terminated  because  one  or  both  of  the  computed  intervals  for  high  and  low 
stopped  strictly  increasing  or  decreasing,  then  these  two  intervals  might  be 
sep2u:ated  by  an  infinitesimally  long  gap  containing  tt.  For  the  definition  of 
“interval  asymptotic  correctness”  given  above,  this  interval  version  of  the 
program  would  not  be  asymptotically  correct. 

There  are,  of  course,  programs  to  compute  intervals  containing  tt  that 
could,  in  the  eisymptotic  case,  be  proved  to  compute  intervals  of  infinitesimal 
length  containing  tt.  It  is  also  true  that  the  program  given  in  Appendix  G  is 
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effectively  asymptotically  correct.  The  example  shows,  though,  that  a  corre¬ 
spondence  between  effective  asymptotic  correctness  and  interval  asymptotic 
correctness  cannot  be  given,  as  we  had  hoped,  by  a  simple  reinterpretation  of 
values  and  operations.  Programs  that  are  intervally  asymptotically  correct 
are  likely  to  contain  interval  operations  such  as  union  and  intersection  that 
do  not  correspond  to  any  scalar  operations. 


2.5  Matijasevich’s  Method 

Y.  Matijasevich  [Mat85]  has  suggested  a  method  for  calculating  rigorous 
bounds  on  the  uncertainties  in  computed  outputs  that  arise  from  uncertain¬ 
ties  in  inputs  and  approximate  calculations.  This  method,  which  is  essentially 
an  efficient  way  of  computing  numerical  bounds  on  pcurtial  derivatives,  ad¬ 
dresses  the  problem  that  there  are  correlations  between  the  errors  in  different 
computed  quantities  that  make  the  bounds  given  by  interval  arithmetic  far 
too  conservative.  His  method  has  serious  limitations,  but  is  worthy  of  further 
study. 

To  obtain  Matijasevich’s  idea  in  its  simplest  form,  first  assume  that  the 
program  Pq  takes  m  values  iq,  . .  • ,  Xm-i  inputs,  computes  distinct  values 
Xrrt,--  - ,  Xn-1  35  intermediate  results,  computes  these  intermediate  results  in 
straight-line  code  that  uses  only  binary  arithmetical  operations,  and  returns 
y  =  z„_i  as  its  fined  result.  Assume  that  all  programs  are  written  in  a  C- 
like  language,  so  =  denotes  assignment,  +=  denotes  incrementing  the  quantity 
on  the  left  by  the  quantity  on  the  right,  and  -=  denotes  decrementing  the 
quantity  on  the  left  by  the  quantity  on  the  right.  To  define  notation,  assume 
the  lines  of  Pq  computing  intermediate  results  are  of  the  form 

Xi  =  Zppj  *,• 

for  m  <  i  <  n,  where  *  6  /}•  The  functions  g  and  h  identify  the  left 

and  right  arguments,  respectively,  of  the  operations  giving  the  intermediate 
results.  For  the  moment,  assume  that  all  machine  operations  are  exact,  and 
ignore  program  constants. 

Matijasevich  points  out  that  ordinary  interval  arithmetic  is  equivalent 
to  replacing  Pq  with  a  new  program  P\  that  takes  not  only  the  m  values 
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Iq,  •  •  • ,  but  also  m  error-bound  values  Cq,  - .  • ,  Sm-i,  as  inputs,  and  com¬ 

putes  for  each  m  <  i  <  n  an  error  bound  e,  such  that  Xi  varies  by  no  more 
than  e<  from  its  computed  value  if  every  Xj,  0  <  j  <  m  varies  by  no  more 
than  ej  from  its  input  value.  The  program  Pi  can  be  obtained  from  Pq  by 
inserting  code  to  compute  the  necessary  error  bound  before  each  hne  com¬ 
puting  one  of  the  intermediate  results.  If  *,  G  {+,—},  for  example,  the  code 
inserted  is 

^A(i)  j 

if  =  •  the  code  inserted  is 


~  ^g(i)  ■  "b  ®h(«)  ■  |^p(i)l  "b  ^A(i)  ■ 

and  if  *,  =  /  the  code  inserted  is 

Ci  =  -b  oo; 
else 

(This  presentation  is  slightly  simpler  than  Matijasevich’s  and  assumes  the 
presence  of  the  lEEEi-arithmetic  value  -f  oo.) 

Matijasevich  next  gives  an  efficient  method  for  calculating  the  partial 
derivatives  dyjdxj  for  0  <  j  <  m  at  the  point  (xq,  •  •  • ,  a^m-i)  and  bounding 
them  as  each  of  the  Xj  varies  by  no  more  than  Cj.  To  calculate  the  paxtial 
derivatives  and  bound  them,  form  the  program  P2  as  follows:  First  append 
to  Pi  the  lines 

^0  =  0; 

do  =  0; 

Zn-2  =  0; 

dn-2  -  0; 

Zn-\  =  1; 

dn-1  ~  Oj 
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Then  as  i  varies  from  n  —  1  downward  to  0,  if  *,•  =  +  append  the  lines, 


^s(') 

dg{i)  +=  di\ 

Zh(t)  +=  Zi\ 
dh(i)  +=  di] 

if  *,  =  —  append  the  lines 

Zg{i)  Zi\ 

dg{i)  +=  d.; 

^A(i)  2^1 » 

Zh(i)  +=  d,-; 

if  *,■  =  •  append  the  lines 

z-g(i)  Zi  •  2; 

dg(i)  l-^tl  ■  ^h{i)  “h  di  •  -f"  d,'  •  C 

Zh(i)  Zi  • 

d/i{«)  \zi  \  •  -f  di  •  Xg^i'i  +  di  ■  Cp(,)J 

and  if  *,■  =  /  append  the  lines 

^a(i)  />(«)! 


,  d,  +  |gi/x/>(.)|  •  em) 

“ff(')  i_  I  -  > 


|2^/i(»)|  ^h{i) 


dh{i)  += 


■  ^a(i)  +  d,  ■  +  di  •  eg^^i'f  + 

(khwl  -  CMo)^ 


h’l'l^y(i)l(^l^)l(l)l’gh(i)+g^(,,) 


The  computed  value  of  Zi  can  be  roughly  described  as  dyjdxi  evaluated 
at  the  point  (iq,  •  •  ■ ,  x^-i)  for  y  expressed  in  terms  of  the  x’s  that  have  been 
considered  so  far.  When  program  P2  terminates,  Zj  —  dyjdxj  evaluated  at 
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(^O)  •  •  •  1  ^m-i)  for  all  0  <  j  <  m.  To  name  the  region  of  points  “near”  the 
point  let 


B  ^Xq  CQ)  3^0  "h  Co]  ^  ***  ^  [^m— 1  i  1  "h  ^m— l]  * 

The  values  d,  are  such  that  if  dyfdxi  is  evaluated  at  different  points  in  B, 
the  value  of  dy/dxi  never  differs  from  2,  by  more  than  d,.  It  is  then  true  that 
for  every  point  (xg, . . . ,  x’^_i)  G  B, 

m— 1  m— 1 

^  ^  2/(^0)  •  ■  -  yi^o,  •  -  • ,  im-i)  <  13  (l^il  +  <^})  • 

i=o  j=o 

Further,  the  error  bound 


m— 1 

i=o 

is  optimal  in  the  sense  that  ratio  of  the  largest  difference  between  the  value 
of  y  at  (xo,. . .  ,Xm_i)  and  at  another  point  in  B  to  e  tends  to  1  as  all  of 
the  tj  tend  to  0.  The  technique  thus  eliminates  the  excess  conservatism  in 
ordinary  interval  arithmetic. 

To  deal  with  error  in  machine  operations,  use  directed  roundings  to  com¬ 
pute  conservative  upper  bounds  in  all  the  calculations  of  the  e,  and  d,,  and 
use  interval  arithmetic  in  all  the  computations  of  the  x,  and  Zi.  Constants  in 
programs  can  be  treated  as  additional  inputs;  a  constant  x,  that  is  machine- 
representable  can  be  dealt  with  more  simply  than  the  true  inputs  because  e,, 
d,  and  z,  are  all  0. 

Matijasevich’s  technique  requires  a  significant  amount  of  additional  com¬ 
putation,  but  this  amoui  t  is  only  proportional  to  the  length  of  the  original 
program.  The  final  program  B?  is  only  a  constant  multiple  longer  than  the 
initial  program  Bo-  By  contrast,  computing  the  partial  derivatives  dxifdxj 
at  (xq,  . . .  ,Xm-i)  for  each  i  and  every  0  <  j  <  m,  as  was  suggested  by 
Hansen  [Han75],  takes  time  proportional  to  the  original  program’s  length 
times  m.  Computing  the  partial  derivatives  by  working  backwards,  by  look¬ 
ing  at  successive  decreasing  values  of  i,  produces  the  necessary  results  much 
faster;  that  is  why  Matijeisevich  called  his  technique  a  “posteriori”  version  of 
interval  analysis. 
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The  most  significant  problem  with  Matijasevich’s  technique  is  that  it  is 
difficult  to  apply  it  to  programs  with  loops.  If  a  program  ran  on  a  machine 
that  stored  all  its  intermediate  results  and  also  stored  a  record  of  which 
operations  it  performed,  one  could  carry  out  his  technique  to  produce  a  new 
program  computing  the  final  error  bound  e  as  before,  but  the  space  needed  to 
store  this  new  program  would  no  longer  be  a  constant  multiple  of  the  space 
needed  to  store  the  original  program. 

Matijasevich  claims  that  for  some  mathematical  problems,  particularly 
calculating  determinants  by  putting  matrices  into  triangular  form,  programs 
exist  that  compute  the  error  bound  e  by  a  similar  technique  that  require 
storage  space  on  the  order  of  that  required  by  the  original  program.  He  does 
not  give  references,  though,  and  suggests  that  even  when  such  programs  exist 
they  cannot  be  produced  by  a  mechanical  transformation.  His  technique  is 
worthy  of  further  attention  because  a  mechanizable  means  might  be  found  for 
taking  advantage  of  posteriori  calculation  of  partial  derivatives  and  bounds 
on  these  derivatives. 
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Chapter  3 

Floating-Point  Arithmetic 


We  noted  in  our  interim  report  [ORA88]  that  one  of  the  “alternatives”  to 
floating-point  arithmetic  that  should  be  considered  was  floating-point  itself. 
We  suggested  that  sufficiently  precise  versions  of  floating-point  arithmetic, 
preferably  ones  meeting  the  IEEE  standard,  might  suffice  to  make  the  as3Tnp- 
totic  model  realistic.  We  expected  that  such  versions  would  have  the  single 
serious  drawback  that  their  operations  would  be  slower,  and  expected  that 
they  could  be  implemented  so  that  this  slowing  would  be  minimal. 

We  promised  to  investigate  the  costs  and  benefits  of  different  versions  of 
floating-point  arithmetic,  to  perform  a  specific  implementation  to  compare 
IEEE  and  VAX  arithmetic,  to  investigate  the  theoretical  speed  limits  of  IEEE 
and  other  forms  of  floating-point  arithmetic,  and  to  look  into  whether  alter¬ 
nate  forms  of  floating-point  arithmetic  might  facilitate  parallel  computation. 
This  chapter  comments  on  the  realism  of  the  asymptotic  model,  discusses 
the  costs  of  precision  in  floating-point  arithmetic,  and  gives  empirical  re¬ 
sults  comparing  IEEE  and  VAX  arithmetic.  It  also  describes  a  version  of 
floating-point  arithmetic  that  facilitates  parallel  computation. 

3.1  Realism  of  the  Asymptotic  Model 

We  have  informally  defined  the  notion  of  asymptotic  correctness  (ORA87] 
and  its  formulation  in  nonstandard  models  of  analysis  [HL85)  in  Section  2.4. 


21 


We  take  the  asymptotic  model  of  computation  to  be  the  one  that  assumes 
valid  machine  operations  on  finite  quantities  produce  results  that  differ  only 
infinitesimally  from  their  ideal  values.  Although  the  Reals  project  developed 
the  asymptotic  model  as  an  ideahzation  that  could  be  used  to  eliminate  most 
bugs  in  numerical  software,  it  is  obviously  a  formalization  of  an  “extremely 
precise”  form  of  floating-point  arithmetic,  and  is  at  least  partly  realistic  for 
every  accurate  implementation  of  floating-point  arithmetic. 

As  we  noted  in  our  interim  report,  a  version  of  floating-point  arithmetic 
with  256-bit  m£uitissas  and  correctly  rounded  operations  would  be  unlikely 
to  accumulate  computational  errors  large  enough  to  be  significant  parts  of 
measurable  quantities  in  any  calculation  that  could  be  carried  out  in  a  real¬ 
istic  time.  In  such  a  floating-point  arithmetic,  the  computational  errors  do 
behave  like  the  infinitesimals  in  nonstandard  models  of  analysis.  We  set  out 
to  determine  how  much  smaller  the  mantissas  could  be  to  give  floating-point 
values  that  would  still  have  this  property. 

We  found,  however,  that  there  are  situations  in  which  no  reasonable 
amount  of  precision  in  floating-point  arithmetic  suffices  to  make  the  asymp¬ 
totic  model  accurate.  The  fragment  of  C  code 

X  =  2.0; 

for(i»0;  i  <  10000;  ++i) 

X  =  sqrt(x) ; 

for(i=0;  i  <  10000;  ++i) 

X  =  x*x; 

y  =  x; 

for  exzunple,  assigns  y  a  value  infinitesimally  close  to  2.0  in  the  asymptotic 
model,  but  on  an  accurate  computer  with  fewer  than  thousands  of  bits  in 
its  floating-point  values  assigns  y  the  value  1.0.  Still,  such  examples  are 
extremely  unrealistic,  so  sufficiently  precise  floating-point  values  can  be  ex¬ 
pected  to  make  the  asymptotic  model  realistic  for  most  code. 

The  extended-precision  values  in  the  IEEE  standard  were  chosen  to  have 
a  range  so  large  that  “over/ underflows,  which  mostly  occur  during  interme¬ 
diate  calculations,  would  almost  disappear”,  and  to  be  precise  enough  that 
“for  many  calculations,  rounding  errors  [are]  really  negligible”  [KP79].  Ordi- 
nwy  double-precision  values  are  also  often  considered  to  have  this  precision 
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property  [Knu81].  This  can  be  construed  as  saying  that  IEEE  extended- 
precision  values,  or  even  ordinary  double-precision  values,  are  thought  of  as 
making  the  asymptotic  model  sufficiently  accurate  for  most  applications;  the 
asymptotic  model  gives  a  precise  interpretation  of  “really  negligible”. 


3.2  Costs  of  Accuracy 

Although  we  emphasized  time  as  the  critical  cost  in  our  interim  report,  the 
time  needed  to  perform  operations  is  not  necessarily  the  most  significant  cost 
of  using  more  precise  floating-point  values.  The  space  needed  to  store  these 
values  was  considered  so  expensive  by  those  choosing  the  IEEE  standard 
that  extended-precision  values  were  intended  to  be  used  only  as  intermediate 
values  [KP79]. 

More  importantly,  even  if  fast  algorithms  theoretically  exist  for  perform¬ 
ing  operations  on  larger  floating-point  numbers  with  little  loss  of  speed  (e.g., 
algorithms  using  Wallace  [Wal64]  or  Dadda  [Dad65]  trees  that  perform  mul¬ 
tiplications  in  times  proportional  to  the  logarithm  of  the  number  of  bits), 
the  size  of  the  necessary  integrated  circuits  can  be  prohibitive.  “Silicon  real 
estate”  is  very  expensive,  so  much  of  the  literature  on  hardware  design  eval¬ 
uates  area/time  tradeoffs  [BPTP87,Fan87,HC87,PSG87,Sha87]. 

Further,  fast  algorithms  can  require  more  complicated  integrated  circuits 
that  are  more  difficult  to  manufacture  and  have  a  lower  yield  of  nondefective 
circuits  than  do  slower  algorithms.  Simplicity  is  enough  of  an  advantage  in  in¬ 
tegrated  circuit  manufacture  that  there  is  engineering  interest  in  theoretically 
slower  algorithms  with  simpler  hardware  implementations  [PSG87,Sha87]. 

Finally,  although  there  are  fast  algorithms  for  addition,  subtraction  and 
multiplication  that  can  be  modified  to  produce  correctly-rounded  results  con¬ 
sistently  with  the  IEEE  standard,  the  fastest  algorithm  ordinarily  suggested 
for  division,  an  iterative  one  [Knu81],  does  not  produce  correctly-rounded  re¬ 
sults  and  so  cannot  be  modified  to  fit  the  IEEE  standard  (see  e.g.,  [Fan87]). 

The  iterative  division  algorithm  takes  time  that  is  proportional  to  log  n, 
where  n  is  the  number  of  bits  in  the  mantissas  of  the  numbers  being  divided. 
Although  we  did  not  locate  proofs  that  division  algorithms  compatible  with 
the  IEEE  standard  which  require  only  time  proportional  to  logn  do  not 
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exist,  that  is  suggested  by  the  engineering  literature,  which  contains  descrip¬ 
tions  of  slower  “fast”  division  algorithms  compatible  with  the  IEEE  standard 
[BPTP87,Fan87,PSG87].  These  algorithms  are  similar  to  ordinary  long  divi¬ 
sion  but  use  redundant  digit-sets  (described  briefly  in  Section  3.4  below)  and 
nonrestoring  division  to  simplify  the  multiplication  of  the  divisor  by  the  next 
digit  in  the  result  and  to  avoid  having  to  “undo”  a  subtraction  if  the  “guess” 
as  to  the  next  digit  was  too  high  (Atk68].  While  these  algorithms  are  much 
faster  than  other  division  algorithms,  they  still  require  time  proportional  to 
n. 

The  issue  of  whether  the  IEEE  standard  forces  division  algorithms  to  take 
times  greater  than  constant  multiples  of  the  time  required  by  the  iterative 
algorithm  did  not  arise  in  debates  over  adoption  of  this  standard;  see  [Cod79, 
Fel79,FW79,KP79,PS79]  and  [Cod81,Coo81,Dem81,Hou81,TP81].  For  prac¬ 
tical  purposes,  IcU'ge  numbers  of  bits  are  not  considered  for  operations  that 
zue  to  be  implemented  in  hardware,  particularly  hardware  constructed  in 
accordance  with  the  IEEE  standard. 


3.3  IEEE  and  VAX  Arithmetic 

We  carried  out  the  implementation  of  scalar  and  interval  versions,  for  IEEE 
and  VAX  arithmetic,  of  programs  to  compute  products  with  Fast  Fourier 
Transforms.  The  code  for  these  programs  is  in  Appendices  B  and  C,  and  an 
edited  version  of  their  output  is  in  Appendix  D.  These  programs  and  their 
output  were  described  in  Section  2.3.  This  section  describes  charaM:teristics 
of  their  output  that  are  significant  for  comparing  IEEE  and  VAX  arithmetic. 

Comparing  IEEE  arithmetic  and  VAX  arithmetic  is  essentially  a  matter 
of  comparing  more  logical  rounding  with  greater  precision.  In  comparing 
roundings,  the  results  of  both  IEEE  and  VAX  arithmetic,  for  IEEE  arithmetic 
in  its  default  round-to-nearest  mode,  are  typicailly  produced  ais  if  these  results 
were  first  computed  exactly,  then  rounded  to  the  nearest  representable  value. 
If  an  exaw:!  result  is  equally  near  two  representable  vadues,  though,  VAX 
arithmetic  rounds  it  to  the  one  with  larger  magnitude  while  IEEE  arithmetic 
rounds  it  to  the  one  whose  least  significant  bit  is  0.  Rounding  errors  are  thus 
more  likely  to  accumulate  in  VAX  than  in  IEEE  arithmetic. 
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In  comparing  precisions,  both  VAX  and  IEEE  double-precision  floating¬ 
point  values  occupy  64  bits.  On  the  VAX  [DigSl],  55  of  these  bits  code  the 
mcintissa  and  8  of  them  code  the  exponent;  in  IEEE  arithmetic  [IEE185],  52 
bits  code  the  significand  and  11  code  the  exponent.  (The  IEEE  standard 
uses  “significand”  instead  of  “mantissa”  because  “significands”  are  defined 
for  things  such  as  infinite  and  “Not  a  Number”  values  for  which  “mantis¬ 
sas”  are  not  defined.)  In  both  VAX  and  IEEE  double- precision  values,  the 
mantissa/significand  is  always  normalized  and  its  leawling  1  is  not  explic¬ 
itly  recorded,  so  the  two  versions  actually  have  56  and  53  bits  of  precision, 
respectively.  IEEE  arithmetic  thus  sacrifices  precision  for  moderately-sized 
numbers  in  order  to  handle  numbers  of  more  varying  magnitudes. 

The  interval  results  show  that  even  with  a  much  cruder  algorithm  for 
determining  interval  endpoints  the  VAX  results  were  better  than  the  IEEE 
results  in  all  except  one  case  in  which  the  IEEE  result  was  not  a  point 
interval.  (The  VAX  endpoint  algorithm  was  too  crude  to  give  VAX  arithmetic 
a  chance  to  produce  point  intervals.)  Ths  scalar  results  were  more  evenly 
matched;  in  three  of  the  eight  cases  in  which  both  results  were  not  perfect, 
the  IEEE  results  were  better.  These  results  indicate  that  having  control  of 
rounding  modes  is  not  in  itself  worth  three  additional  bits  of  precision,  but 
that  avoiding  rounding  errors  that  tend  to  accumulate  is  almost  worth  those 
three  bits. 


3.4  Floating-Point  and  Pgirallel  Processing 

We  did  find  information  in  the  engineering  literature  on  alternative  represen¬ 
tations  of  floating-point  values  that  facilitate  performing  mwy  different  oper¬ 
ations  in  parallel.  The  basic  idea,  which  is  also  used  in  the  lexicographically- 
coded  continued  fraction  version  of  finite  rational  arithmetic  to  be  described 
in  Section  5.2,  is  of  doing  on-line  arithmetic. 

In  on-line  arithmetic,  each  argument  to  an  operation  is  represented  2is  a 
sequence  of  generalized  digits  that  could  be  bits,  ordinary  digits,  negative 
digits,  or  so  forth.  The  algorithm  performing  the  operation  inputs  successive 
digits  from  its  inputs  and  produces  a  sequence  of  similar  successive  digits  as 
its  output. 
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Some  definitions  of  on-line  axithmetic  (e.g.,  [TE77])  also  impose  the  axl- 
ditional  restrictions  that  the  values  are  represented  so  that  their  most  signif¬ 
icant  digits  occur  first  and  that  each  operation  produces  the  jth  digit  of  its 
output  after  taking  in  at  most  j  +  S  digits  of  its  inputs,  where  ^  is  a  positive 
integer  called  the  operation’s  delay.  With  the  more  restricted  definition,  on¬ 
line  operations  can  perform  variable-precision  arithmetic  and  produce  more 
or  fewer  result  digits  as  desired;  there  is  no  problem  of  having  a  potentially 
infinite  wait  for  the  next  digit. 

With  on-line  arithmetic,  performing  many  computations  in  parallel  is 
simple,  <ind  no  potentially-intractable  problems  with  making  sure  the  results 
of  one  computation  are  available  before  they  need  to  be  used  in  another  one 
need  to  be  solved.  The  expression 

z  =  CQ-irX-{ci-\-x-{c2-\-x-  C3)), 

can  be  evaluated  as 

z  =  5o(co,Po(a;,<?i(cx,Pi(x,52(c2,/2(a:,C3)))))), 

where  5o,  S\  and  52  denote  circuits  producing  sums  and  Pq,  Pi  and  P2  denote 
circuits  producing  products.  The  digits  of  x  can  be  duplicated  and  sent  to 
Pq,  P\  and  P2  simultaneously.  As  soon  as  P2  begins  to  produce  result  digits, 
S2  can  begin  combining  them  with  digits  from  C2  while  P2  continues  taking 
in  digits  from  x  and  C3.  As  soon  as  S2  begins  producing  result  digits  Pi  can 
start  operating  while  S2  and  P2  continue  operating,  and  as  soon  as  Pi  begins 
producing  result  digits  Si  can  begin  operating  while  Pi,  S2  and  P2  continue 
operating.  Eventually,  parts  of  all  the  necessary  operations  can  be  being 
performed  simultaneously.  The  more  complicated  evaluating  an  expression 
becomes,  the  greater  is  the  possibility  for  performing  many  operations  in 
parallel. 

Ercegovac  and  Watanuki  [WE81]  have  developed  an  on-line  version,  using 
the  more  restricted  definition  of  “on-line”,  of  floating-point  arithmetic.  Their 
version  of  floating-point  uses  a  maximally  redundant  set  of  digits  to  code  the 
mantissas.  For  numbers  to  a  base  6,  the  maximally  redundant  set  of  digits  is 

{-(6-1),. ..,-1,0,1,.. .,(6-1)}. 

As  notation  for  negative  digits,  let  the  negative  of  an  ordinary  digit  be  given 
by  putting  a  bar  over  the  digit,  so  3  =  —3,  9  =  —9,  and  so  on.  The  nonzero 
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numbers  with  redundant  digit-sets  are  otherwise  interpreted  as  if  they  were 
ordinary  base-6  floating-point  numbers  whose  mantissas  have  magnitudes 
between  0  and  1.  For  6  =  10,  for  example,  the  value  with  mantissa  125  and 
exponent  2  denotes  the  number  1  •  10  -f-  (—2)  •  1  -f  5  •  0.1  =  8.5. 

Ercegovac  and  Watanuki’s  system  treats  the  exponent  of  a  floating-point 
value  as  a  single  digit,  and  each  operation  determines  the  exponent  and  first 
digit  of  its  result  after  reading  the  exponents  and  first  digits  of  its  arguments. 
Their  on-line  floating-point  arithmetic  operations  are  then  essentially  given 
by  corresponding  on-line  arithmetic  operations  for  the  meintissas.  Addition 
and  subtraction  are  easy  —  we  will  briefly  describe  base- 10  addition  to  show 
how  the  redundant  digit  set  is  used  —  and  multiplication  and  division  algo¬ 
rithms  are  given  in  [TE77]. 

To  add,  take  the  exponent  of  the  result  to  be  the  larger  of  the  exponents 
of  the  arguments,  and  shift  the  mantissa  of  the  argument  with  the  smaller 
exponent  to  the  right,  filling  in  a  leading  0  and  incrementing  that  argument’s 
exponent  with  each  shift,  until  the  exponents  are  equal.  Next  output  the 
successive  digits  of  the  result’s  mantissa  such  that; 

1.  The  first  j  (or  j  -I-  1  if  there  is  an  initial  carry)  digits  of  the  result’s 
mantissa  equal  the  sum  of  the  first  j  digits  of  the  properly-aligned 
arguments’  mantissas;  and 

2.  The  last  (the  jth  or  j  -1-  1st)  digit  of  the  result’s  mantissa  is  not  9  or  9. 

The  condition  about  the  last  digit  can  be  achieved  since  9  =  ll  and  9  =  11, 
and  any  carry  cannot  affect  more  than  the  next-to-last  digit  —  by  induction 
the  next-to-last  digit  was  not  9  or  9.  It  is  thus  possible  to  awld  the  m«mtissas 
with  a  delay  of  at  most  1. 

Redundemt  digit  sets  essentially  get  their  power  because  it  is  not  necessary 
to  determine  Ccich  successive  result  digit  exactly.  It  is  sufficient  to  output  a 
result  digit  that  is  close  to  the  value  it  would  have  in  a  nonredundant  system, 
then  correct  it  later  if  necessary.  In  ordinary  base- 10,  for  example,  the  third 
digit  of  a  number  whose  partial  computation  begins  0.32999 ...  is  uncertain, 
but  in  redundant  base-10  it  can  be  taken  to  be  3;  if  the  number  turns  out  to 
be  smaller,  say  0.329998,  express  it  ais  0.330002. 
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Chapter  4 


Mathematics  for  Alternatives 


This  chapter  describes  approximate  rational  arithmetic,  mediant  rounding, 
standard  and  generali^'-d  ontinued  fractions  and  Gosper’s  algorithm.  These 
concepts  are  used  in  .several  of  the  alternative  representation  systems  from 
the  literature  thai  .^re  described  in  Chapter  5,  partieuWly  the  fixed-slash, 
floating-slash  ?ad  lexicographic  continued  fraction  (LCF)  representations  by 
Matula  and  Kornerup  [MK80,KM81,MK83,KM83a,MK86,KM86,KM88]. 


4.1  Approximate  Rational  Arithmetic 

Our  interim  report  [ORA88]  only  discussed  rational  arithmetic  as  a  means  for 
doing  exact  computation.  It  only  considered  rationals  given  as  pairs  of  lists 
specifying  arbitrarily-large  integers,  or  as  scaled  versions  of  such  rationals 
multiplied  by  powers  of  a  fixed  base.  We  argued  that  the  inability  of  such 
representations  to  naturally  discard  information,  with  the  loss  of  speed  and 
excess  use  of  memory  that  this  implies,  makes  rational  arithmetic  impractical 
for  typical  applications. 

There  are  alternatives,  though,  of  approximate  rational  arithmetic.  These 
define  approximate  arithmetic  operations  on  finite  sets  of  rationals,  rationals 
that  can  eeich  be  stored  in  a  fixed  number  of  bits.  In  these  arithmetics,  the 
results  of  an  operation  are  the  rationals  that  would  be  obtained  by  applying 
a  fixed  rounding,  as  in  Section  2.1,  to  the  operation’s  ideal  results,  rounding 
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them  to  representable  values.  These  arithmetics  avoid,  to  differing  degrees, 
the  problem  of  a  steadily-growing  use  of  time  and  spax:e  that  is  implied  by 
exact  rational  arithmetic,  but  still  allow  many  rationals  to  be  represented 
exactly  and  many  operations  on  rationals  to  be  performed  exactly. 

With  this  very  general  definition,  some  versions  of  floating-point  arith¬ 
metic,  particularly  the  VAX  and  IEEE  ones,  are  approximate  rational  arith¬ 
metics;  floating-point  values  are  rationals,  and  these  arithmetics  effectively 
obtain  their  results  by  applying  fixed  rounding  functions.  All  versions  of 
what  are  normally  called  approximate  rational  arithmetics,  though,  includ¬ 
ing  all  those  proposed  by  Matula  and  Kornerup,  differ  from  floating-point 
in  exactly  representing  large  numbers  of  simple  rationals,  e.g.,  1/3.  Further, 
in  all  these  arithmetics  — i  and  l/r  are  representable  whenever  x  is;  1/0  is 
representable  and  is  taken  as  -l-oo. 

In  typical  floating-point  calculations,  even  input  floating-point  values  are 
thought  of  as  approximations  to  ideal  real  numbers  that  are  usually  irra¬ 
tional.  We  will  call  the  numbers  arising  in  a  calculation  rational  if  the  ideal 
inputs,  exact  intermediate  results,  and  exact  outputs  of  the  calculation  are 
rational,  and  call  these  numbers  irrational  otherwise.  The  general  hope  be¬ 
hind  approximate  rational  arithmetics  is  that  representable  rationals  will  oc¬ 
cur  frequently  in  calculations,  so  that  the  approximate  arithmetics  will  often 
capture  the  advantages  of  exact  rational  arithmetic  while  still  maintaining 
adequate  precision  in  other  cases. 

All  versions  of  approximate  ^ation^d  arithmetic  provide  means  for  dis¬ 
carding  information,  particularly  in  the  process  of  replacing  exact  results 
by  their  representable  approximations.  The  different  approximate  rational 
arithmetics  vary  in  which  rationals  are  representable,  which  rounding  im¬ 
plicitly  computes  representable  approximations,  and  which  exceptions,  error 
indications  and  error  values  are  returned  in  exceptional  cases.  Choosing 
eunong  such  systems  requires  making  trade-offs  between  the  systems’  time 
and  space  requirements,  the  range  of  the  numbers  they  represent,  and  the 
utility  of  any  additional  information  they  provide. 

Call  a  fraction  p/q  simpler  than  a  fraction  p'/q'  if  p  <  p',  q  <  q',  and  at 
least  one  of  these  inequalities  is  strict.  Let  a  simple  chain  be  a  finite  set  of 
irreducible  fractions,  ordered  by  the  usual  order  on  the  real  numbers,  with 
the  property  that  all  irreducible  fractions  simpler  tham  some  member  of  the 
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chain  are  also  in  the  chain.  Every  simple  chain  contains  the  rationals  0/1 
and  1/0,  which  represent  0  and  +oo  respectively;  call  a  simple  chain  triv¬ 
ial  if  it  contains  only  these  rationals.  In  each  of  the  approximate  rational 
arithmetics  considered  in  Chapter  5,  0  is  representable,  the  negative  of  ev¬ 
ery  representable  value  is  representable,  and  the  nonnegative  representable 
rationals  form  a  simple  chain. 

As  an  example,  the  set 

0  1  1  2  1  3  2  3  1 

^r3’2’3’i’2’rro-^ 

is  a  simple  chain,  but  would  not  be  if  3/1  were  removed;  3/1  is  simpler  than 
3/2,  and  3/2  is  in  the  set.  As  the  example  indicates,  the  members  of  a 
nontrivial  simple  chain  are  not  evenly  spaced. 

The  different  arithmetics  proposed  by  Matula  and  Kornerup  listed  above 
all  use  a  process,  described  in  the  next  section,  called  mediant  rounding  to 
round  unrepresentable  reals  to  representable  ones.  The  approximate  arith¬ 
metic  by  Hwang  and  Chang  [HC78],  and  the  one  by  Yoshida  [Yos83],  do  not; 
they  use  roundings  similar  to  the  roundings  for  VAX  arithmetic  and  for  IEEE 
arithmetic  in  its  default,  round-to-nearest  mode. 

Mediant  rounding  quite  often  does  not  round  unrepresentable  reals  to 
the  nearest  representable  ones.  This  rounding,  based  on  the  theory  of  best 
rational  approximations,  is  biased  in  favor  of  simplicity,  so  it  is  more  likely 
to  give  simple  rationals  than  complicated  ones  as  results. 

With  mediant  rounding,  errors  in  intermediate  results  often  cancel  out 
completely  in  computations  on  rational  numbers.  Errors  in  computations 
whose  ideal  inputs  are  irrational,  though,  tend  to  be  larger  with  mediant 
rounding  than  they  would  be  with  round-to-nearest  rounding.  Empirical 
results  to  these  effects  are  described  in  Subsection  5.1.4. 

The  uneven  spacing  of  members  of  simple  chains,  and  the  fact  that  medi¬ 
ant  rounding  leads  errors  to  cancel  out,  both  have  as  a  consequence  that  our 
interim-report  [ORA88]  intention  to  view  alternative  representation  systems 
only  as  means  for  representing  the  endpoints  of  intervals  was  too  restrictive. 
None  of  the  fixed-slash,  floating-slash  or  LCF  representations  are  appropri¬ 
ate  for  representing  the  endpoints  of  intervals  because  the  degree  of  conser¬ 
vatism  introduced  by  bounding  reals  with  representable  values  varies  greatly 
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—  measures  on  this  variability  are  given  in  Subsections  5.1.3  and  5.2.2.  The 
characteristic  of  mediant  rounding  that  it  tends  to  make  errors  cancel  out 
gives  these  representations  a  significant  advantage  over  other  representations, 
though,  an  advantage  contrasting  with  one  of  interval  arithmetic’s  greatest 
weaknesses  —  c.f..  Section  2.3. 

Still,  the  importance  of  mediant  rounding’s  ability  to  make  errors  cancel 
out  should  not  be  exaggerated.  Errors  tend  to  largely  cancel  out  even  without 
roundings  biased  in  favor  of  the  correct  results.  Note  in  Appendix  D  that, 
even  when  the  IEEE  interval  results  indicate  some  uncertainty,  three  of  the 
eight  scalar  FFT-multiply  results  for  IEEE  arithmetic,  and  one  of  the  eight 
for  VAX  arithmetic,  are  exactly  correct. 

4.2  Mediant  Rounding 

This  section  defines  mediant  rounding  and  gives  some  of  its  properties.  The 
following  definitions  and  results  are  classic  pieces  of  number  theory  [HW60]; 
proofs  of  the  results  are  repeated  in  (MK80]. 

Let  the  letters  p,  q,  a.nd  their  primed  variants  all  denote  nonnegative 
integers.  Call  two  fractions  p/q  and  p'/q'  adjacent  if  \p/q  —  p'/q'\  =  lfq(f, 
or  equivalently  if  |p^  —  p'q\  =  1.  Note  that  adjacent  fractions  must  be 
irreducible,  and  that  except  for  the  pair  (0/1, 1/0)  one  of  the  two  must  be 
simpler  than  the  other. 

Let  the  mediant  of  p/q  and  p' /(f  be  (p  +  p')/{q  +  q')-  ^  p/q  and  p'/q'  are 
adjacent  and  p/q  <  p'/q',  then  the  following  are  all  true; 

1.  p/q  <  (p  +  p')/{q  +  q')  <  p'/q'; 

2.  p/q  and  p'/q'  are  both  simpler  than  (p  +  p')/(q  +  q');  and 
3-  (P +  ?')/(«? +  <?')  is  the  simplest  rational  strictly  between  p/q  and  p'/q". 

Further,  consecutive  members  of  any  simple  chain  are  adjacent.  (No,  this  is 
not  obvious,  and  neither  is  the  third  of  the  facts  just  listed.) 

From  these  facts  it  follows  that  if  p/q  and  p'/(f  are  consecutive  members 
of  any  nontrivial  simple  chain  and  p/q  <  p'/q',  then 


1-  p'Iq'  -  p/q  =  i/w'; 

2.  {p  +  p')l{q  +  q')  is  not  as  simple  as  either  of  pjq  and  p' lq\  and  does 
not  belong  to  the  simple  chain;  and 

3.  (p  +  p')/iq  +  q')  is  farther  from  the  simpler  of  p/q  cind  jZ/q'. 

If  R,  a  set  of  representable  rationals,  consists  of  the  members  of  a  simple 
chain  and  their  negations,  define  the  mediant  rounding  function  from 
arbitrajy  reals  to  members  of  /2  as  follows  [MK80]:  Let  x  be  a  real  number. 
Ux  e  R,  let  $/i(x)  =  X.  If  X  <  0  let  $ft(x)  =  — $fl(-x).  If  x  is  positive  and 
not  in  R,  there  exist  two  consecutive  fractions  pJq  and  p'/q'  in  R  such  that 
p/q  <  X  <  p'/q'.  In  this  case,  let  m  =  ip  +  p')/{q  +  q')  be  the  mediant  of  p/q 
and  p'/q',  then  let  $fl(x)  =  p/q  if  x  <  m,  let  ^nix)  =  p'/q'  if  x  >  m,  and  let 
$fl(x)  be  the  simpler  of  p/q  and  p' /(f  if  x  =  m. 

Since  the  mediant  (p  +  p')/{q  +  of  adjacent  fractions  p/q  and  p'/q'  is 
farther  from  the  simpler  of  the  two,  mediant  rounding  rounds  most  of  the 
reals  in  the  interval  from  p/q  to  p'/q'  to  the  simpler  of  the  two.  Further,  the 
interval  of  reals  that  are  rounded  to  p/q  is  longer  the  simpler  p/q  is.  These 
phenomena  give  the  precise  meaning  of  the  statement  that  mediant  rounding 
is  “biased  in  favor  of  simplicity”. 

Actually  performing  mediant  rounding  uses  the  theory  of  continued  frac¬ 
tions,  which  is  also  basic  to  the  LCF  representation.  We  will  give  an  ex¬ 
plicit  algorithm  for  computing  in  the  next  section,  a  section  that  defines 
standard  continued  fractions,  defines  concepts  and  notations  associated  with 
them,  and  states  some  of  their  properties. 


4.3  Standard  Continued  Fractions 

For  any  real  x,  there  exist  unique  reals  Hq  and  Tq  such  that  Uq  is  an  integer, 
0  <  To  <  1 ,  and 

I  =  no  -f  To. 

If  To  ^  0,  expressing  r-Q  as  \/{\/tq)  and  repeating  the  process  gives  unique 
reals  ni  and  ri  such  that  rii  is  an  integer,  0  <  n  <  1,  and 

1 

I  =r  no  + - . 

”1 
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Similarly,  if  0  there  exist  unique  reals  n2  and  such  that  TI2  is  an 
integer,  0  <  r2  <  1,  and 


X  =  no  + 


1 

na+rj 


The  sequence  of  integers  no,ni, . . .  generated  in  this  way  is  called  the  standard 
continued  fraction  for  x,  and  the  successive  integers  are  called  the  fraction’s 
partial  quotients.  For  the  remainder  of  this  section,  assume  that  all  continued 
fractions  are  standard. 


It  is  traditional  to  write  a  sequence  of  integers  to  be  interpreted  as  the 
partial  quotients  of  a  continued  fraction  in  brackets,  so  for  x  and  the  n,  as 
above, 

X  =  [no,ni,...]. 

Any  rational’s  continued  fraction  is  finite  —  i.e.,  for  x  and  the  Uj  and 
as  above,  if  x  is  rational  the  process  eventually  terminates  for  some  t  with 
ri  =  0.  As  an  example, 

2.31  =  [2, 3, 4, 2, 3]. 


Since  for  any  integer  n,  n  =  (n  —  +  relaxing  the  condition  <  1 

to  Tj  <  1  gives  every  rational  two  different  continued  fractions;  e.g., 

2.31  =  [2, 3, 4, 2, 3]  =  [2, 3, 4, 2, 2, 1]. 

The  usual  definition  of  continued  fractions,  with  its  r,  <  1  restriction,  elimi¬ 
nates  the  second  of  these  two  possibilities.  In  the  LCF  representation,  how¬ 
ever,  every  finite  continued  fr2u:tion  must  have  an  even  number  of  partial 
quotients,  so  this  representation  eliminates  the  first  of  the  two  possibilities 
for  2.31. 

Many  important  irrational  numbers  have  continued  fractions  that  are 
infinite  but  particularly  simple.  It  is  true,  for  example,  that 

V2  =  (l,2,2,2,2.2,...] 

and  that 

e  =  [2,l,2,l,l,4,l,l,6,l,l,8,l,l,...]. 

(Here  e  is  the  base  of  the  natiita!  logarithms.)  Continued  frsw:tion8  do  not 
depend  on  the  choice  of  a  base,  so  continued  fraction  representations  do  not 
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depend  on  properties  of  the  numbers  10,  2,  8  or  16  decimal,  binary,  octal 
or  hexidecimal  representations  do. 

For  a  real  i  =  (no, ni,. define  the  sequences  of  integers  (p,)  and  (^,) 
by 

P-2  =  0, 

Q-2  =  1, 

P-l  =  1, 

q-i  =  0, 

Pi  —  ^i  ■  Pi-1  +  Pi-2i  and 

qi  =  n,  •  q,-_,  +  *.2 

for  all  t  >  0.  The  rationals  p,/?,-  are  called  the  convergents  of  x.  For  <ill  i  >  0, 
the  integers  pi  and  q,-  and  the  convergents  of  x  have  the  following  properties, 
all  of  which  are  classic  number  theory  results  [HW60]  repreated  in  [MK80]: 


1-  Pi/qi  =  (no,ni,...,n,]  ; 

2.  gcd(p.,q,)  =  1  ; 

3.  (Adjacency)  q,p,_i  -  Piqi-i  =  (-1)*  ; 

4.  (Alternating  convergence) 

P2i 


Es.  ^El 

qo  q2 


< 


?2i 


<  X  < 


<^< 

92>-1 


93  9i 


5.  (Best  rational  approximation)  If  r/s  Pijqi  and  s  <  qi  then 


r 

> 

Pi 

- X 

- X 

s 

qi 

6.  (Quadratic  convergence) 


1 


9.(9.+i  +  9.) 


^-x 

9. 


< 


1 


9i9i+i 


We  can  now  give  an  explicit  algorithm  for  computing  the  mediant  round¬ 
ing  function  $r,  an  algorithm  strongly  related  to  Euclid’s  greatest  common 
denominator  algorithm  (KM83a].  For  an  arbitrary  real  number  x  and  a  set  of 
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rationals  R  consisting  of  the  members  of  a  simple  chain  and  their  negations, 
compute  the  convergents  Po/90)Pi/?i>  •  -  •  of  |x|.  Then 


(  X 


^r{x)  = 


Pihi 


if  X  €  ii, 
if  X  <  0, 

if  X  >  0,pi/9i  e  -R,Pi+i/9,+i  ^  R- 


As  a  point  of  interest,  while  the  last  convergent  of  i  that  is  in  R  gives  the 
mediant-round  of  x  to  a  value  in  /?,  a  method  similar  to  the  one  for  computing 
successive  convergents  from  partial  quotients  gives  values  between  the  last 
convergent  in  R  and  the  first  one  not  in  /?,  and  it  can  be  used  to  find  the 
member  of  R  that  is  actually  closest  to  x  [Lov86]. 


4.4  Generalized  Continued  Fractions 


A  more  general  form  of  continued  fractions  gives  a  faster  algorithm  for  do¬ 
ing  mediant  rounding,  a  possible  extension  of  the  LCF  representation,  and  a 
variant  of  some  of  the  constructive-real  representations  to  be  considered  in 
Chapter  6.  These  generalized  continued  fractions  give  redundant  represen¬ 
tations  of  the  reals  that  have  the  same  sorts  of  advantages  that  redundant¬ 
digit-set  representations  do.  The  basic  idea  of  generalized  continued  fractions 
is  to  stop  considering  only  integers  less  than  or  equal  to  particular  reals. 

For  any  real  x,  let  mo  be  any  integer  such  that  |x  —  mo|  <  1,  emd  let  so 
be  the  real  such  that 

X  =  mo  So. 

If  So  7^  0,  express  sq  as  l/(l/so)  and  repeat  the  process,  finding  an  integer 
mi  and  a  real  si  such  that  (sij  <  1  and 


I  =  mo  -h 


1 


mi  -f-  Si 

Similarly,  if  si  ^  0,  find  an  integer  n2  and  a  real  S2  such  that  (s2|  <  1  and 

1 


X  =  mo  -1- 


mi  H - ^ 

‘  mj+»3 
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Call  any  sequence  of  integers  . .  generated  by  this  process  a  gener¬ 

alized  continued  fraction  for  x.  As  before,  caJl  the  successive  integers  the 
fraction’s  partial  quotients. 

Unlike  standard  continued  fractions,  many  different  generalized  continued 
fractions  can  represent  the  same  real,  even  if  a  convention  is  adopted,  say, 
to  force  the  number  of  partial  quotients  in  a  generalized  continued  fraction 
to  be  even.  However,  convergents  can  be  defined  for  generalized  continued 
fractions  just  as  before,  and  they  have  many  of  the  same  properties. 

If  generalized  continued  fractions  axe  restricted  so  that  their  partial  quo¬ 
tients  are  “best  possible”  integer  approximations,  the  restricted  generalized 
continued  fraction  for  every  real  is  unique.  Using  the  notation  just  given,  let 
the  optimal  (generalized)  continued  fraction  for  x  be  the  one  such  that  for 
all  h  I'Sij  <  1/2  and  s,-  has  the  same  sign  as  m,-  if  |sj|  =  1/2.  The  optimal 
continued  fraction  for  e,  for  example,  is  given  by 

e  =  [3, -4,2, 5, -2, -7,2,9, -2,-11, 2,13, -2, -15,...]. 

A  stamdard  continued  fraction  that  does  not  have  1  as  one  of  its  partial 
quotients  is  optimal,  though  1  can  (rarely)  occur  as  a  partial  quotient  in  a 
continued  fraction  that  is  optimal.  All  except  possibly  the  first  of  the  partial 
quotients  of  an  optimal  continued  fraction  have  magnitude  at  lecist  2.  One 
can  show  that  the  sequence  of  convergents  for  a  real’s  optimal  continued  frac¬ 
tion  is  a  subsequence  of  the  sequence  of  convergents  for  the  real’s  standard 
continued  fraction.  (C.f.  [KM83a].)  The  convergents  of  a  gener2Llized,  par¬ 
ticularly  optimal,  continued  fraction  can  thus  converge  to  a  re2J  more  quickly 
than  do  the  convergents  of  that  real’s  standard  continued  fraction. 

As  an  e.xample,  the  number  49/30  has  the  standard  continued  fraction 
49/30  =  [1,1, 1,1,2, 1,2], 

with  convergents  1/1,  2/1,  3/2,  5/3,  13/8,  18/11,  and  49/30.  The  same 
number  has  the  optimal  continued  fraction 

49/30  =  [2, -3, 4, -3], 

with  convergents  2/1,  5/3,  18/11  and  49/30. 
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Matula  and  Kornerup  give  an  algorithm  in  [KM83a]  for  performing  me¬ 
diant  rounding  that  computes  generalized  continued  fractions  and  always 
attempts  to  impose  the  restriction  |s,j  <  1/2.  This  algorithm  succeeds  in 
imposing  this  restriction  most  of  the  time,  so  it  determines  “almost  optimal” 
generalized  continued  fractions.  An  “almost  optimal”  generalized  contin¬ 
ued  fraction’s  sequence  of  convergents  is  not  necessarily  a  subsequence  of  the 
corresponding  standard  continued  fraction’s  convergents,  but  does  determine 
the  last  representable  member  of  the  standard  continued  fraction’s  sequence 
of  convergents  [KM83a],  and  hence  can  be  used  to  do  mediant  rounding. 
This  algorithm  is  faster  than  the  algorithm  given  earUer  because  “almost 
optimal”  continued  fractions’  convergents  typically  converge  faster  than  the 
corresponding  standard  continued  fractions’  convergents  do. 

In  the  terminology  introduced  above,  finding  the  optimal  continued  frac¬ 
tion  for  an  x  being  determined  by  successive  approximations  is  only  difficult 
when  one  of  the  |s,|  w  1/2.  In  practical  algorithms  where  this  problem  arises, 
like  the  faster  mediant  rounding  algorithm  or  the  generalizations  of  Gosper’s 
algorithm  to  be  discussed  in  Section  4.5  and  in  Chapter  6,  the  algorithm 
makes  a  simple  trade-off  between  the  benefit  of  having  optimal  continued 
fractions  rather  than  nearly-optimal  ones  and  the  cost  of  distinguishing  opti¬ 
mal  from  nearly-optimal  ones.  Some  of  the  algorithms  in  Chapter  6  would  not 
work  without  this  flexibility,  which  arises  from  the  redundancy  in  generalized 
continued  fractions.  The  redundancy  in  generalized  continued  fractions  adso 
provides  a  possible  means  of  extending  the  LCF  representation  discussed  in 
Section  5.2. 


4.5  Gosper’s  Algorithm 

The  LCF  representation,  some  results  in  Chapter  6,  and  some  of  the  research 
questions  in  Chapter  7,  dl  use  or  refer  to  Gosper’s  algorithm  [Gos72]  for  find¬ 
ing  the  sum,  difference,  product  or  quotient  of  two  (standard  or  generalized) 
continued  fractions.  We  will  first  describe  the  algorithm  for  standard  con¬ 
tinued  fractions,  then  indicate  how  to  extend  it  to  do  arithmetic  on  LCF 
encodings,  and  finally  tell  how  to  adapt  it  to  generalized  continued  fractions. 

First  assume  that  all  continued  fractions  are  standard,  and  identify  reals 
and  their  continued  fractions.  Given  continued  fr2u:tions  for  the  reals  x  and  y. 
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Gosper’s  algorithm  computes  the  successive  partial  quotients  in  the  continued 
fraction  for  z,  where 


axy  -i-  In  +  cy  +  d 
exy  +  fx  +  gy  h 

for  almost  any  integers  a^b,c,d,e,f,g  and  h.  The  only  restrictions  on  a 
through  h  arise  because  of  problems  with  possibly  dividing  by  0. 

The  algorithm  treats  a  through  h  as  assignable  variables  whose  values 
can  be  changed,  and  operates  by  inputting  partial  quotients  from  i  and  y, 
outputting  partial  quotients  to  z,  and  making  corresponding  changes  to  the 
variables  a  through  h.  The  algorithm  also  treats  x,  y  and  z  as  assignable 
variables  whose  values  can  change.  To  define  notation,  if 

X  =  [no,n,,n2,...],y  =  [mo,m,,m2, . . .]  and^r  =  [^, Atj, ia •  •  •], 

let 

x'  =  [ni,n2,..  .l,y'  =  [mi, m2,...],  andz'  =  [A:i,i2.-  •]• 

Changing  the  sequence  of  remaining  partial  quotients  for  x  from  (no, «!,...) 
to  (ni,.. .),  which  the  algorithm  often  does,  is  equivalent  to  replacing  the 
value  of  assignable  variable  x  with  the  value  x\  and  similarly  with  changing 
the  sequences  of  remaining  partial  quotients  for  y  or  z. 

Gosper’s  algorithm  inputs  the  first  partial  quotient  of  one  of  its  inputs, 
^ay  no  of  x,  and  updates  the  integers  a  through  h  as  follows: 


z{x,y) 


a  — >  a  •  no  +  c  , 
b  — >  b-Ho  +  d  , 
c  — ►  a  , 
d  — >  b  , 
e  — »  e  •  no  +  y  , 

/  — *  f  no  +  h, 

g  — f  €  ,  and 

h  /. 

By  the  relationship  x  =  no+ 1/x',  the  old  value  of  2(1,  y)  is  then  the  new  value 
of  z{x',y).  When  it  takes  a  partial  quotient  from  the  continued  fraction  of 
I,  though,  the  algorithm  implicitly  changes  the  (new)  value  of  i  to  the  (old) 
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value  of  x\  so  the  old  value  of  z(x,  y)  is  equal  to  the  new  value  of  z(x,  y).  This 
process  is  called  ingesting  a  partial  quotient  from  x.  The  process  of  ingesting 
a  partial  quotient  from  y  is  similar;  the  necessary  undates  to  a  through  h  can 
be  calculated  using  the  relationship  y  =  mo  +  1/y'- 

After  one  or  more  of  its  initial  partial  quotients  have  been  ingested,  the 
value  of  x  must  be  in  the  interval  (l,oo),  and  similarly  for  y.  If  the  denom¬ 
inator  of  the  updated  expression  defining  z  cannot  be  0,  which  it  cannot 
be  if  the  initial  values  of  a  through  h  defined  one  of  the  four  operations  of 
addition,  subtraction,  multiplication  and  division,  2(1,00),  2(00,1) 

and  2(00,00)  bound  the  possible  values  of  z{x,y)  [KM88].  The  values  of  2 
at  these  four  extremes  converge  toward  each  other  as  the  algorithm  ingests 
more  partial  quotients  from  x  and  y.  In  general,  ingesting  a  partial  quotient 
of  X  reduces  the  uncertainty  in  2  caused  by  uncertainty  in  x,  and  ingesting  a 
partial  quotient  of  y  reduces  the  uncertainty  in  z  caused  by  uncertainty  in  y. 

It  might  be  possible  to  increase  the  parallelism  in  computations  by  choos¬ 
ing  the  integers  a  through  h  to  compute  two  or  more  operations  at  the  same 
time.  In  that  case,  having  the  denominator  of  2  be  0  cam  be  a  problem 
[KM88]. 

If  a  special  symbol  is  used  as  an  “end  marker”  for  continued  fractions,  so 
that  it  is  possible  to  detect  when  all  the  partial  quotients  of  an  input  have 
been  ingested,  that  input  is  called  exhausted.  It  is  only  necessary  to  check 
two  extremes  to  determine  the  possible  range  for  z(x,  y)  if  one  of  x  or  y  is 
exhausted,  and  the  value  of  2  is  completely  determined  if  both  i  and  y  are 
exhausted. 

After  Gosper’s  algorithm  has  ingested  a  large  enough  number  of  partial 
quotients  from  x  cind  y,  the  integer  portion  of  2,  the  first  partial  quotient  of 
2’s  continued  fraction,  is  determined  whatever  the  current  vaJues  for  x  and 
y  are.  If  this  partial  quotient  is  ko,  the  algorithm  outputs  ko  and  updates  a 
through  h  as  follows: 
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a  — y  e  , 
b  f, 
c  — *  g  , 
d  — y  h  , 
e  — ►  a  —  ko  -  e  , 
f  b-ko-f, 
g  — y  c-  ko-  g  ,  and 
h  — y  d  —  ko  •  h. 

By  the  relationship  z  =  ko  +  l/z*,  the  old  value  of  z'{x,y)  is  then  the  new 
value  of  z{x,y).  When  it  outputs  a  partial  quotient  of  the  continued  fraction 
for  z,  though,  the  algorithm  implicitly  changes  the  (new)  value  of  z  to  the 
(old)  value  of  z',  so  the  continued  fraction  for  the  new  value  of  z  is  the 
remainder  of  the  continued  fraction  for  the  old  value  of  z.  This  process  is 
called  outputting  a  partial  quotient  of  z.  If  it  is  not  yet  possible  to  output 
the  next  partial  quotient  of  z,  a  good  strategy  for  the  algorithm  is  to  ingest 
a  partial  quotient  from  whichever  of  x  and  y  seems  to  cause  the  most  change 
in  2:  as  2  varies  between  its  extremes. 

Following  Matula  and  Kornerup  [KM88],  call  the  8-tuple  of  integers  a 
through  h  the  coefficient  cube]  each  number  corresponds  to  a  corner  of  the 
cube.  The  processes  of  ingesting  partial  quotients  of  x  and  y  and  of  out- 
putting  partial  quotients  of  z  thus  cause  changes  in  the  coefficient  cube,  so 
the  coefficient  cube  reflects  the  current  status  of  the  computation  of  z. 

Matula  and  Kornerup  [KM88]  give  a  method  for  evaluating  z  at  its  ex¬ 
tremes  that  reduces  the  amount  of  computation  needed  and  also  makes  their 
extension  of  Gosper’s  algorithm  to  LCF  encodings  possible.  They  define  an 
8- tuple  of  integers  A  through  H  called  a  decision  cube  having  the  property 
that  the  four  extreme  values  of  z  are  given  by 

^(1- 1)  =  2r(oo.  1)  =  -py  ^(l,oo)  =  — ,  2(00,00)  =  — . 


The  initial  entries  of  the  decision  cube  can  be  computed  from  the  initial 
entries  of  the  coeffirient  cube  by 


A  —  a,  B  —  a  +  b,  C 
E  =  e,  F  =  e  +  f,  G 


a-i-c,  D  =  a  +  b  +  c  +  d, 
e  +  g,  H  =  e  +  f  +  g  +  h. 


40 


I 


We  will  show  how  the  algorithm  updates  the  members  of  the  decision  cube  as 
it  ingests  and  outputs  partial  quotients  after  we  give  Matula  and  Kornerup’s 
critical  observation  that  the  processes  of  updating  the  coefficient  and  decision 
cubes  can  be  described  as  matrix  multiplications. 

Matula  and  Kornerup  [KM88]  note  that  the  transformation  in  the  coef¬ 
ficient  cube  produced  by  ingesting  the  partial  quotient  no  of  x  is  equivalent 
to  multiplying  each  of  the  two  matrices 


by  the  matrix 


d  b 
c  a 


and 


h  f 
9  e 


0  1 

1  no 


Similarly,  the  transformation  in  the  coefficient  cube  produced  by  ingesting 
the  partial  quotient  mo  of  y  is  equivalent  to  multiplying  each  of  the  matrices 


by  the  matrix 


d  c 
b  a 


and 


h  g 

f  e 


0  1 

1  mo 


Likewise,  the  transformation  in  the  coefficient  cube  produced  by  outputting 
the  partial  quotient  Aq  of  js  equivalent  to  multipljdng  each  of  the  matrices 


by  the  matrix 


d  h 
b  f 


and 


c  9 
a  e 


0 

1 


The  corresponding  transformations  on  the  decision  cube  are  given  for  the 
same  ingesting  of  no  from  x  by  multiplying  the  matrices 


D  B 
C  A 


and 


H  F  ' 
G  E 
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by  the  matrix 


1  1 
no  no  -  1 

they  are  given  for  the  same  ingesting  of  mo  from  y  by  multiplying  the  matrices 


D  C 
B  A 


and 


H  G 
F  E 


by  the  matrix 

1  1 
mo  mo  —  1 

and  given  for  the  same  output  of  ^  to  z  by  multiplying  the  matrices 


D  H 
B  F 


and 


C  G 
A  E 


by  the  matrix 

■  0  1 
1  —ko 


The  observation  that  the  updates  to  the  coefficient  and  decision  cubes 
can  be  made  by  multiplying  by  matrices  is  significant  because  each  of  the 
matrices 


■  0 

1 

■  0  1  ■ 

1 

1 

'  1  1  ■ 

,  imd 

■  0  1  ■ 

1 

TIq 

J 

1  mo 

no 

no  -  1 

> 

mo  mo  -  1 

1  -ko 

can  be  factored,  making  it  possible  to  ingest  or  output  “pieces”  of  partiaJ 
quotients.  This  is  critical  to  doing  arithmetic  in  the  LCF  representation, 
which  is  described  in  Subsection  5.2.3. 

The  most  serious  “catch”  in  Gosper’s  algorithm,  ignoring  for  the  moment 
the  possibility  of  ingesting  or  outputting  “pieces”  of  partial  quotients,  is  that 
the  four  possible  extreme  values  of  z  might  all  be  close  to  a  particular  integer 
k,  with  some  slightly  above  k  and  some  slightly  below  it.  In  such  a  situation, 
it  is  impossible  to  determine  whether  the  next  partial  quotient  of  z  is  or 
k  —  1,  and  this  situation  might  persist  indefinitely  as  the  algorithm  ingests 
more  and  more  partial  quotients  from  x  and  y,  even  though  the  four  possible 
extreme  values  of  z  would  get  closer  and  closer  to  k.  Such  a  situation  is 
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exactly  like  not  being  able  to  determine  the  third  digit  of  an  ordinary  base- 
10  number  whose  partial,  approximate  computation  begins  0.32999...  . 

Gosper’s  algorithm  can  be  extended  to  generalized  continued  fractions. 
For  X,  x',  j/,  y\  z  and  z'  as  defined  before,  but  without  the  restriction  that 
all  except  the  first  of  the  partial  quotients  of  x,  y  and  z  must  be  positive, 
the  relationships  x  =  no  -f  l/i',  y  =  mo  +  Ify'  and  z  =  ko  +  Ijz'  are 
still  valid.  Gosper’s  algorithm  can  thus  be  applied  even  if  x,  y  and  z  are  all 
generalized  continued  fractions.  The  only  additional  complications,  described 
in  Subsection  6.4.2,  are  that  there  are  more  possibilities  for  having  0  as  the 
denominator  of  the  expression  defining  z,  and  more  extreme  values  to  check 
in  bounding  z.  Even  after  ingesting  one  or  more  partial  quotients  from  a 
generalized  continued  fraction  for  x,  for  example,  x  might  be  amy  member  of 
the  set  (1,  -f-oo)  U  (— oo,  —  1).  Even  if  the  generalized  continued  fraction  for  x 
is  known  to  be  optimal,  after  ingesting  one  or  more  partial  quotients  x  can 
stiU  be  any  member  of  the  set  [2,  -boo)  U  (— oo,  —2]. 

Generalized  continued  fractions  avoid  the  main  “catch”  in  Gosper’s  algo¬ 
rithm.  If  there  is  an  integer  k  such  that  all  the  possible  values  of  z  are  less 
than  distance  1  from  k,  then  k  can  be  taken  as  the  next  partial  quotient  of 
z.  If  all  the  possible  values  of  z  are  less  than  distance  1/2  from  fc,  this  choice 
of  k  is  optimal.  It  might  sometimes  happen  that  it  is  impossible  to  find  the 
optimal  next  partial  quotient  for  z  —  in  situations  where  the  possible  values 
of  z  cluster  around  a  point  midway  between  two  consecutive  integers  —  but 
it  is  always  possible  to  find  an  acceptable  next  partial  quotient  for  z.  Decid¬ 
ing  how  many  partial  quotients  to  ingest  in  such  a  situation  is  just  a  matter 
of  convenience,  balancing  the  benefits  of  having  optimal  partial  quotients 
against  the  costs  of  determining  them. 

Being  able  to  ingest  and  output  “pieces”  of  partial  quotients  lessens  the 
impact  of  the  main  “catch”  in  Gosper’s  algorithm,  but  it  does  not  eliminate 
it.  We  discuss  this  problem  in  Subsection  5.2.4,  sketch  the  method  Matula 
and  Kornerup  propose  for  solving  it,  and  suggest  that  using  extensions  of 
Gosper’s  algorithm  to  generalized  continued  fractions  might  give  another 
way  of  solving  it. 

The  second  “catch”  with  Gosper’s  edgorithm  is  that  the  entries  of  the 
coefficient  cube,  which  determine  the  effects  the  next  input  partial  quotients 
have  on  the  output,  can  grow  arbitrarily  large  as  more  and  more  partial 
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quotients  are  ingested  from  x  and  y.  This  problem  is  discussed  in  Section 
5.2,  particularly  Subsection  5.2.4,  and  in  Chapter  6.  It  also  inspired  some  of 
the  questions  for  future  research  in  Chapter  7. 
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Chapter  5 


Alternatives  in  the  Literature 


This  chapter  describes  the  following  proposed  representation  systems  for 
computer  arithmetic;  The  fixed-slash  and  floating-slash  representations  by 
Matula  and  Kornerup  [KM81,KM83a,MK85];  the  binary-coded  lexicographic 
continued  fraction  (LCF)  representation  by  Matula  and  Kornerup  [MK83, 
KM85,KM87,KM88];  the  hybrid  fixed-slash  and  floating-point  representation 
by  Hwang  and  Chang  [HC78];  the  variable-length-exponent  representation 
by  Iri  and  Matsui  [M181];  the  repeating- mantissa  floating-point  representa¬ 
tion  by  Yoshida  [Yos83|;  the  hyper-exponential  representation  by  Olver  and 
Clenshaw  [01v87];  and  the  finite  p-adic  representation  by  Gregory  and  Kr- 
ishnamurty  [GK84].  Our  descriptions  of  each  alternative  include  comments 
on  its  fitness  as  a  scheme  for  representing  the  endpoints  of  interv£ils,  and 
whether  it  f2w:ilitates  parallel  computation. 

We  argue  that  the  variable-length-exponent  representation  would  be  bet¬ 
ter  for  command  and  control  applications  than  standard  floating-point  rep¬ 
resentations.  This  and  the  other  representations  form  the  basis  for  our  new 
representation  proposals  in  Chapter  7. 


5.1  Fixed-Slash  and  Floating-Slash 

This  section  defines  the  fixed-slash  and  floating-slash  representations  sys¬ 
tems  proposed  by  Matula  and  Kornerup  [KM81,KM83a,MK85].  Both  sys- 
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terns  represent  rational  numbers  as  numerator-denominator  pairs  stored  in 
fields  containing  a  fixed  number  of  bits.  A  hypothetical  “sliish”,  similar  to 
the  slash  that  ordinarily  indicates  division  and  separates  numerators  from 
denominators  in  fractions  such  as  56501575/103247889,  separates  the  nu¬ 
merator  in  each  field  from  the  denominator.  In  the  fixed-slash  representation 
the  slash  is  always  fixed  at  the  center  of  the  field  of  bits.  In  the  floating-slash 
representation  the  slash’s  position  is  determined  by  a  separate  slash-position 
integer,  and  can  be  at  any  position  in  the  field  of  bits  and  also  at  “positions” 
outside  the  ends  of  the  field. 

5.1.1  Representations  of  Numbers 

More  specifically,  in  a  {2k  2)-bit  fixed-slash  system,  a  value  consists  of  a 
sign  bit,  k  numerator  bits,  an  exact/inexact  bit,  and  k  denominator  bits. 
The  numerator  and  denominator  fields  give  nonnegative  integers  p  and 
respectively,  in  binary.  If  s  €  {0,1}  is  the  sign  bit,  the  value  represents  the 
rational  (— l)*p/?.  The  value  is  in  normal  form  if  gcd{p,q)  =  1. 

The  rationals  representable  in  a  (2A:-f2)-bit  fixed-slash  system  are  exactly 
the  n  =  2*  —  1  order-n  Farey  fractions,  where  Fn  is  defined  by 

■■0<p,q<  n,gcd{p,q)  =  1 

It  is  e<isy  to  see  that,  since  -1-1/0  =  -foo  is  a  representable  value,  the  nonneg¬ 
ative  members  of  form  a  simple  chain  for  any  n  >  1,  and  every  member 
of  F„  is  either  a  member  or  the  negative  of  a  member  of  this  simple  chain. 

Though  we  will  not  give  them  here,  Matula  and  Kornerup  define  special 
combinations  with  0  in  the  numerator,  denominator  or  both  to  represent  ±0, 
±oo  and  not-a-number  NaN  values  similar  to  those  in  IEEE  floating-point 
arithmetic  [IEE85].  The  exact/inexact  bit  is  used  to  record  whether  the 
rational  stored  is  the  exact  result  of  the  operation  that  produced  it  or  an 
approximation  to  this  result. 

In  a  (A:  -f  m  -f-  l)-bit  floating-slash  system,  with  m  >  \0g2k,  a  value 
consists  of  a  sign  bit,  an  exact/inexact  bit,  a  (A:  —  l)-bit  fraction  field  giving  a 
concatenated  numerator  and  denominator-with-leading-bit-deleted  pair,  ajid 
an  m-bit  signed  binary  integer  giving  the  slash  position.  Let  s  6  {0,1}  be 
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the  sign  bit,  let  /  be  the  value  in  the  fraction  field,  and  let  e  be  the  integer 
stored  in  the  m-bit  slash  position.  If  e  is  not  the  largest  integer  representable 
in  m  bits,  the  number  y  represented  by  a  (A:  +  m  +  l)-bit  floating-slash  value 
is  given  by  y  =  (—I)*  ’P/?)  where  the  positive  integers  p  and  q  are  determined 
from  e  and  /  as  follows:  If  0  <  e  <  A:  —  2,  p  is  the  binary  integer  given  by  the 
first  k—l—e  bits  of  /,  and  q  is  the  binary  integer  given  by  1  concatenated  with 
the  fined  e  bits  of  f.  If  e  <  0,  p  is  the  binary  integer  given  by  1  concatenated 
with  all  A:  —  1  bits  of  /  and  with  a  string  of  |e|  O’s,  and  y  is  1.  If  e  >  A:  —  1, 
p  is  1  and  q  is  the  binary  integer  given  by  1  concatenated  with  all  A:  —  1  bits 
of  /  and  with  e  —  (A:  —  1)  O’s.  The  value  is  in  normal  form  if  gcd(p,y)  =  1. 

If  e  is  the  leirgest  integer  representable  in  m  bits,  the  value  represented 
is  either  ±oo  or  NaN.  Though  we  will  not  give  them  here,  Matula  and 
Kornerup  list  specific  interpretations  for  e  and  /  in  this  case  to  distinguish 
the  separate  possibilities.  As  in  fixed-slash,  the  exact/inexact  bit  in  floating- 
slash  distinguishes  exact  rationals  from  approximating  ones. 

To  give  simple  examples,  let  A:  =  m  =  5  and  ignore  the  sign  and  ex¬ 
act/inexact  bits.  The  field  /  then  contains  5  —  1  =  4  bits.  If  the  slash 
position  is  given  by  a  signed  binary  integer,  the  possible  values  of  e  range 
from  -15  to  15.  If  e  =  15,  the  value  coded  is  ±oo  or  NaN.  Several  examples 
for  e  <  15  follow: 


e  =  1 

/  =  1001 

p  =  IOO2  =  4 

9  =  II2  =  3 

e  =  2 

o 

o 

II 

P  =  0l2  =  l 

5=  IIO2  =  6 

e  =  0 

/  =  1001 

p  =  IOOI2  =  9 

q  =  1 

e  =  -2 

/  =  1010 

p  =  IIOIOOO2  =  104 

q  =  1 

e  =  4 

/  =  1001 

p  =  1 

q  =  IIOOI2  =  25 

e  =  6 

/  =  1010 

p  =  1 

q  =  IIOIOOO2  =  104 

Actually,  we  have  described  only  one  of  the  possible  encodings  for  floating- 
slash  values  suggested  by  Matula  and  Kornerup.  They  note  in  [MK85]  that 
the  operations  on  floating-slash  values  might  be  speeded  up  by  putting  the 
denominator  bits  before  the  numerator  bits  and  putting  them  in  reverse 
order. 

Let  the  extended  {k  +  m  +  l)-bit  floating-slash  numbers  be  the  rationals 
representable  by  (A:  -f-  m  -|-  l)-bit  floating-slash  values  with  slash- position 
values  e  such  that  e  <  0  or  e  >  A:  —  2.  If  e  <  0,  an  extended  (A:  -|-  m  -|-  1)- 
bit  floating- slash  number  is  identical  to  a  base-2  floating-point  number  with 
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an  implicit  leading  1  and  k  —  1  more  bits  in  its  mantissa.  If  e  >  fc  —  2, 
an  extended  (Ar  +  m  +  l)-bit  floating-slash  number  is  very  similar  to  such  a 
bcise-2  floating-point  number  with  a  negative  exponent.  The  floating-slash 
representation  thus  has  floating-point’s  capacity  to  conveniently  represent 
numbers  of  widely  varying  magnitudes. 

Let  FSLk,  the  standard  floating-slash  numbers  with  {k  —  l)-bit  fraction 
fields,  be  the  union  of  {±0/l,dbl/0}  and  the  set  of  rationals  representable 
by  (A:  -b  m  -1-  l)-bit  floating-slash  values  with  slash-positions  e  in  the  range 
0  <  e  <  fc  —  2.  It  is  easy  to  check  that  the  nonnegative  members  of  FSLk 
form  a  simple  chain. 

It  is  also  easy  to  check  that  FSLk  is  closely  related  to  the  n  =  2*'^  —  1 
set  of  order- n  hyperbolic  fractions  defined  by 

Hn  =  :  pq  <  n,gcd(p,9)  =  l|  • 

Specifically, 

FSLk-i  C  C  FSLk- 

The  nonnegative  members  of  <ilso  form  a  simple  chain,  and  FSLk  can  be 
approximated  by  with  the  loss  of  less  than  one  bit  of  representation 

capacity.  Thus  while  the  floating-slash  values  are  dependent  on  the  choice  of 
2  as  a  base,  a  set  of  standard  floating-slash  numbers  is  very  similar  to  a  set 
of  hyperbolic  fractions,  so  is  almost  base- independent. 

The  loss  of  storage  efficiency  for  both  the  fixed-slash  and  floating-slash 
representations  created  by  the  presence  of  nonnormalized  values  is  minimal. 
Let  F^  be  the  positive  order-n  Farey  fractions,  and  for  a  set  X  let  |X|  denote 
the  cardinality  of  X.  By  results  of  Dirichlet  [MK85], 

lim  \Fn\ln^  =  A/tt*  w  0.6079 

n-*oo  ' 


and 


_ \M _ 

\{±plq  :  pg  <  n,p>  l,g  >  1}| 


=  e/TT*  «  0.6079. 


The  loss  from  nonnormalized  values  is  thus  less  than  one  bit. 
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5.1.2  Arithmetic  Operations 


For  fixed-slash  and  standard  floating-slash  numbers,  arithmetic  is  performed 
as  described  earlier:  The  result  of  any  operation  is  computed  as  if  the  oper¬ 
ation  were  first  performed  exactly  and  then  mediant  rounding  was  used  to 
round  the  result  to  a  representable  value.  Matula  and  Kornerup  give  algo¬ 
rithms  for  this  arithmetic  in  (KM83a].  It  is  actually  not  necessewy  to  compute 
the  exact  result  of  an  operation  first  and  then  apply  a  mediant  rounding  al¬ 
gorithm;  one  can  get  the  effect  of  this  by  initializing  p_2,  9-2>  P-i  ^.nd  in 
the  mediant  rounding  algorithm  (see  Section  4.3)  to  values  determined  by  the 
binary  operation  and  one  of  its  arguments,  and  then  applying  the  algorithm 
to  compute  “convergents”  of  the  other  argument.  See  [KM83a]  for  details. 
This  means  of  doing  auithmetic  is  essentially  a  special  case  of  the  edgorithm 
by  Gosper  described  in  Section  4.5. 

Matula  and  Kornerup  [KM83a]  give  algorithms  using  both  the  standard 
continued  fraction  and  the  generalized  continued  fraction  versions  of  mediant 
rounding  described  in  Sections  4.3  and  4.4.  Both  of  these  algorithms  can  be 
made  to  perform  arithmetical  operations  at  the  same  time  they  deduce  proper 
roundings  by  making  appropriate  initializations.  Matula  and  Kornerup  show 
how  these  algorithms  can  be  implemented  with  binary  shift  and  add/subtract 
operations,  and  note  which  parts  of  these  algorithms  can  be  performed  in 
parallel.  These  observations  are  special  cases  of,  and  inspired,  their  ideas 
on  how  arithmetic  could  be  performed  in  LCF  that  will  be  described  in 
Subsection  5.2.3. 

These  algorithms  produce  correct,  but  possibly  nonnormalized,  results  if 
their  arguments  are  not  normalized.  It  is  thus  not  necessary  to  normalize 
arguments  before  performing  arithmetic  operations. 

Matula  and  Kornerup  also  suggest  how  the  add/subtract  operations  might 
be  speeded  up  by  using  carry-save  or  borrow-save  arithmetic,  techniques 
that  use  redundancy  in  the  representation  of  integers.  The  £ilgorithms  for 
performing  arithmetic  and  doing  rounding  require  being  able  to  normalize 
integers  and  determine  whether  they  are  zero,  operations  that  axe  generally 
difficult  for  redundantly-coded  integers.  Matula  and  Kornerup  give  tech¬ 
niques  for  reducing  the  zero-determination  problem  to  the  normalization  one, 
and  for  doing  the  necessary  integer  normalizations  for  both  carry-save  and 
borrow-save  representations  by  looking  only  at  the  first  two  digits  [KM83a].  If 
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redundantly-coded  integers  are  used,  integer  results  must  be  converted  back 
into  standard  binary  to  produce  the  final  fixed-  or  floating-slash  results. 

Although  arithmetic  units  for  fixed-slash  and  standard  floating-slash  rep¬ 
resentations  have  not  yet  been  implemented  in  hardware,  Matula  and  Ko- 
merup  give  a  theoretical  an<dysis  of  the  numbers  of  operations  and  sub¬ 
operations  that  must  be  performed  in  carrying  out  their  algorithms  that 
suggest  the  times  needed  to  do  fixed-slash  and  standard  floating-slash  arith¬ 
metic  tire  comparable  to  the  times  needed  to  do  noniterative  divide  operations 
on  binary  integers  with  similar  numbers  of  bits  [KM83a].  They  also  found 
results  consistent  with  these  estimates  in  simulation  experiments. 

Matula  and  Kornerup  do  not  give  explicit  algorithms  for  doing  arithmetic 
on  pairs  of  extended  floating-slash  values  or  pairs  containing  a  standard  2md 
an  extended  floating-slash  value.  Presumably,  computed  results  could  always 
be  produced  as  if  the  operations  were  first  performed  exactly  and  nonzero 
exact  results  were  rounded  as  follows: 

•  K  the  magnitude  of  the  exact  result  is  between  that  of  the  smallest  and 
largest  positive  stand2ird  floating-slash  numbers,  use  mediant  rounding 
to  round  to  a  standard  floating-slash  number. 

•  Otherwise,  round  the  exact  result  to  the  nearest  extended  floating- 
slash  number  if  this  number  is  determined  uniquely,  aind  round  it  to 
the  nearest  extended  floating-slash  number  whose  final  digit  is  0  if  there 
are  two  equaily-near  such  numbers. 

We  do  not  know,  however,  how  making  such  an  extension  to  the  arithmetic 
unit  proposed  by  Matula  and  Kornerup  would  affect  the  unit’s  speed  and 
complexity. 


5.1.3  Errors  in  Mediant  Rounding 

The  gap  sizes  between  consecutive  representable  numbers  for  both  fixed- 
slash  and  standard  floating-sl2ish  representations  vary  greatly,  so  the  amount 
of  error  introduced  by  mediant  rounding  can  also  vary  greatly.  We  will  follow 
Matula  and  Kornerup  in  approximating  the  fixed-slash  and  stsuid£U'd  floating- 
slash  representations  by  Farey  and  hyperbolic  fractions,  respectively.  By  the 
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results  on  simple  chains  given  earlier,  if  pjq  and  p'/9^  consecutive 

Faxey  or  hyperbolic  fractions  then 

pV?'  -  P/9  =  1/w',  so  ^  ^ 

For  the  order-n  Farey  fractions,  the  gap  sizes  for  consecutive  fractions  in  the 
interval  [0,1]  thus  vary  from  about  1/n^  to  about  1/n,  and  for  the  order-n 
hyperbolic  fractions  the  relative  gap  sizes  for  consecutive  finite  fractions  vary 
from  about  1/n  to  about  Xjy/n. 

The  absolute  error  introduced  by  mediant  rounding  on  the  interval  [0,1]  in 
a  (2fc-}-2)-bit  fixed-slash  system  can  thus  be  as  large  as  about  2~^,  and  the  rel¬ 
ative  error  introduced  by  mediant  rounding  in  a  (fc  +  ni-f  l)-bit  floating-slash 
system  can  be  as  large  as  about  2'*^^.  These  error  values  are  fax  worse  than 
the  corresponding  absolute  error  of  about  2“^*'  for  21r-bit  binary  fixed-point 
numbers  and  the  corresponding  relative  error  of  about  2~*  for  fc-bit  binary 
floating-point  numbers.  However,  as  we  will  now  show,  typical  absolute  and 
relative  error  values  introduced  by  mediant  rounding  are  much  smaller  and 
axe  comparable  to  the  errors  in  ordinary  fixed-point  and  floating-point  arith¬ 
metic.  Thus,  even  though  mediant  rounding  tends  to  produce  errors  that  are 
larger  than  those  for  round-to-nearest  rounding  in  computations  whose  ideal 
results  are  not  rational,  the  loss  can  be  expected  to  be  small. 

Typical  errors  introduced  by  mediant  rounding  are  much  smaller  than 
meiximum  ones  because  typical  gap  sizes  between  consecutive  Faxey  or  hy¬ 
perbolic  fractions  axe  much  closer  to  the  lower  extremes  than  the  upper  ones. 
Large  gap  sizes  only  occur  around  comparatively  rare  simple  fractions.  Fur¬ 
ther,  around  these  simple  fractions  the  large  gap  sizes  give  mediant  rounding 
its  desirable  bias  towards  simplicity. 

If  H  is  a  set  of  rationals  consisting  of  the  members  of  a  simple  chain  and 
their  negatives,  and  is  the  mediant  rounding  function  rounding  reals  to 
members  of  fZ,  Matula  and  Kornerup  show  the  following  [MK80]; 

•  ]i  R  =  Fn  and  X  is  a  uniformly-distributed  random  variable  on  [0,1], 
Exp{|JC-*„WI)  =  ;^!^  +  o(l) 
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and  for  1  <  a  <  2, 

Prob{|A-  -  >  n-“}  <  2n“-^ 

*  If  R  =  Hn  and  A"  is  a  log-uniformJy-distributed  random  variable  on 
[l/n,n], 

and  for  1  /2  <  a  <  1  and  n  sufficiently  large, 

Absolute  and  relative  error  values  are  thus  closely  distributed  <iround  ex¬ 
pected  values  close  to  the  error  values  for  corresponding  binary  iixed-point 
or  floating-point  numbers  with  the  same  number  of  bits  of  precision.  Em¬ 
pirical  results  on  average  and  “better  than  all  except  one  in  a  million  or 
one  in  a  trillion”  gap  sizes  and  relative  gap  sizes,  for  implementations  of 
flxed-slash  and  floating-slash  representations,  aie  consistent  with  these  ex¬ 
pectations  (MK85]. 

5.1.4  Empirical  Results 

Matula  and  Ferguson  [FM85]  give  empirical  information  strongly  related  to 
mediauit  rounding  and  standard  floating-slash  arithmetic.  They  produce  this 
information  with  a  floating-point  simulator  that  computes  an  approximation 
to  for  i?  a  set  of  hyperbolic  fractions.  They  use  this  simulator  to  perform 
two  sets  of  complicated  calculations,  one  for  ideal  numbers  which  are  all 
rationals  and  the  other  for  ided  numbers  which  are  all  irrationals.  They 
compare  the  results  for  these  calculations,  using  the  simulator  to  perform 
the  arithmetic  operations,  with  the  results  for  these  calculations  produced 
using  ordinary  floating-point  arithmetic. 

Although  they  are  intended  to  test  floating-slash  arithmetic,  the  results 
actually  compare  two  different  ways  of  using  floating-point  hardware.  The 
results  axe  staled  not  with  respect  to  ideal  numbers,  but  with  respect  to 
the  best-possible  approximations  to  these  ideal  numbers  for  the  version  of 
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floating-point  arithmetic  being  used.  They  show,  among  other  things,  that 
doing  a  floating-point  simulation  of  mediant  rounding  can  produce  signif¬ 
icantly  more  accurate  answers  for  calculations  involving  quantities  whose 
ideal  values  are  simple  rationals. 

The  simulator  represents  a  fraction  p/q  as  the  floating-point  result  of 
dividing  the  integer  p  by  the  integer  q.  It  simulates  approximate  rational 
arithmetic  by  using  a  floating-point  approximation  to  the  rational  arithmetic 
algorithm  described  in  Section  4.3.  To  perform  addition,  say,  it  starts  with 
two  of  its  floating-point  representations  for  rationals,  computes  the  floating¬ 
point  sum  V  of  these  representations,  uses  floating-point  arithmetic  to  com¬ 
pute  (approximations  to)  the  partial  quotients  and  convergents  of  v,  finds  the 
last  convergent  p,/?,  such  that  piqi  <  m  for  an  integer  limit  m,  and  returns 
the  floating-point  result  of  dividing  p,  by  9,  as  the  result  of  the  addition. 

Note  that  since  partial  quotients  are  integers,  many  of  the  partial  quo¬ 
tients  and  convergents  the  simulator  computes  are  exactly  correct.  Note 
further,  though,  that  the  intermediate  result  v  is  only  an  approximation  to 
the  ideal  result  of  performing  the  desired  operation  on  the  pair  of  rationals 
represented  by  the  floating-point  values  combined  to  produce  v,  and  the  er¬ 
rors  in  floating-point  inversion  will  also  eventually  cause  the  computed  partial 
quotients  for  v  to  be  incorrect. 

The  simulator  produced  the  empirical  results  in  [FM85]  by  running  on 
a  CDC  6600.  It  used  single-precision  arithmetic  in  some  tests,  and  double¬ 
precision  arithmetic  in  others.  Since  the  floating-point  values  on  the  CDC 
6600  have  48-bit  mantissas  in  single-precision  and  96-bit  mantissas  in  double¬ 
precision,  the  simulation  took  m  =  2^  for  the  single-precision  calculations 
2md  m  =  2^  for  the  double- precision  ones,  hi  all  cases  it  thus  computed 
convergents  until  the  product  of  their  numerator  and  denominator  was  no 
longer  exactly  representable  in  the  floating-point  arithmetic  currently  being 
used. 

The  computations  involving  only  numbers  whose  ideal  values  are  rational 
were  finding  the  inverses  of  Hilbert  matrices.  The  order-n  Hilbert  matrix  H„ 
is  the  n  by  n  matrix  whose  entry  in  position  t,  j,  1  <  i,j  <  n,  is  l/(t-4-j-l-l). 
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For  example, 


r  1/1 

1/2 

1/3' 

H3  = 

1/2 

1/3 

1/4 

.  1/3 

1/4 

1/5  J 

Hilbert  matrices  arise  in  finding  the  best  least-squares  approximation,  over 
the  interval  [0,1],  of  a  continuous  function  by  a  polynomial  of  a  given  degree 
[FM85]. 

The  problem  of  inverting  Hilbert  matrices  is  often  used  as  a  test  of  arith¬ 
metic  systems  (c.f.,  the  comments  on  matrix  inversion  in  [KL85])  because 
it  can  be  solved  exactly,  using  formulas  included  in  [FM85],  and  is  very  ill- 
conditioned.  The  entries  of  axe  integers  for  all  n,  and  the  entry  with  the 
largest  magnitude  increases  very  rapidly  as  n  increases  —  the  largest  entry  in 
H5  for  example,  is  179200.  For  any  matrix  A,  let  mcix(A)  be  the  magnitude 
of  the  element  of  A  for  which  this  magnitude  is  maximum.  As  Matula  and 
Ferguson  explain  in  [FM85],  if  A  is  an  n  by  n  matrix,  the  condition  number 
n  ■  max(A)  •  max(A“*)  estimates  how  greatly  an  initial  relative  error  in  an 
entry  in  A  is  magnified  into  a  final  relative  error  in  an  entry  in  a  computed 
solution  to  AX  =  B. 

The  base- 10  logarithm  of  A’s  condition  number  thus  estimates  how  mtiny 
base-10  digits  are  likely  to  be  lost  in  the  computed  final  entries  of  A"';  lost  in 
addition  to  those  digits  already  lost  by  approximating  these  entries  and  the 
entries  of  A  by  floating-point  numbers  of  a  particular  precision.  (These  error 
estimates  could  probably  have  been  stated  more  clearly  in  terms  of  mantissa 
bits  lost  in  the  computed  results.)  Using  estimates  in  the  literature,  Matula 
and  Ferguson  estimate  that  about  1.63n  decimal  digits  of  accuracy  can  be 
expected  to  be  lost  in  computing  the  inverse  of  H„.  Since  the  largest  entry 
in  H„  is  1,  this  estimate  is  equivalent  to  estimating  the  greatest  magnitude 
of  an  entry  in  as  10^  "''^". 

The  algorithm  Matula  and  Ferguson  use  for  computing  the  inverse  of 
an  n  by  n  matrix  A,  a  method  recommended  as  reducing  computational 
error,  treats  the  identity  AA~'  =  /  as  a  collection  of  n  systems  of  linear 
equations.  It  solves  each  system  as  follows:  It  uses  Gaussian  elimination 
witn  no  pivoting  to  compute  a  factorization  A  =  LU  of  A,  where  L  and  U 
are  lower-  and  upper-triangular  matrices  respectively.  It  then  solves  LY  =  B 
by  forward  elimination,  and  solves  UX  =  Y  hy  backward  substitution. 
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In  their  results,  Matula  and  Ferguson  give  estimated  and  observed  num¬ 
bers  of  decimal  digits  of  accuracy  lost  in  computing  the  inverses  of  Hilbert 
matrices  of  different  orders.  The  estimated  digit-loss  numbers  are  those  given 
by  the  condition  numbers,  and  the  observed  digit-loss  numbers  are  calculated 
from  the  relative  errors,  scaled  to  fit  the  precision  of  the  floating-point  arith¬ 
metic  being  used,  in  the  entries  of  the  computed  matrices.  They  give  results 
for  calculations  using  four  different  types  of  arithmetic; 

1.  Single-precision  floating-point; 

2.  Double-precision  floating-point; 

3.  Simulated  approximate  rational  arithmetic,  single-precision  floating¬ 
point  and  m  =  2^]  and 

4.  Simulated  approximate  rational  arithmetic,  double-precision  floating¬ 
point  and  m  =  2^. 

The  results,  given  in  Figure  4  in  [FM85],  are  striking.  The  observed  num¬ 
bers  of  digits  lost  using  straightforward  floating-point  arithmetic,  both  for 
single-  and  double- precision,  closely  match  the  losses  predicted  by  the  condi¬ 
tion  numbers  —  in  both  cases  the  observed  loss  numbers  vary  from  about  1 
to  4  fewer  digits  lost,  but  these  differences  are  for  estimated  numbers  of  digits 
lost  as  large  as  29.  The  numbers  of  digits  lost  using  simulated  approximate 
rational  arithmetic  are  much  lower.  The  single-precision  simulation  inverts 
each  of  Hi  through  Hg  with  the  loss  of  less  than  a  single  digit.  The  double¬ 
precision  simulation  inverts  each  of  Hi  through  Hig  with  the  loss  of  less  than 
a  single  digit;  for  Hig,  by  contr<ist,  straightforward  double-precision  arith¬ 
metic  loses  about  25  of  the  roughly  29  digits  available  in  double-precision 
floating-point. 

Note  that  by  the  estimate  used  for  the  condition  numbers,  the  magnitude 
of  the  largest  element  in  H^g  is  on  the  order  of  2®®  ®  and  the  magnitude 
of  the  largest  element  in  Hg*  is  on  the  order  of  2^®-^.  Both  simulations 
thus  only  begin  to  lose  accuracy  in  inverting  the  Hilbert  matrices  when  the 
inverses  begin  to  contain  elements  that  are  not  exactly  representable  in  the 
floating-point  arithmetic  being  used. 

The  simulations’  observed  numbers  of  digits  lost  increase  sharply  beyond 
these  limits,  going  to  about  7  for  single-precision  on  Hio  and  to  about  10  for 


55 


double-precision  on  H20,  but  the  single-  and  double-precision  simulations  suc¬ 
ceed  in  inverting  all  the  Hilbert  matrices  through  H15  and  H29,  respectively, 
before  their  numbers  of  digits  lost  exhaust  all  the  digits  available  for  their 
respective  forms  of  floating-point  arithmetic.  The  straightforward  single-  and 
double-precision  floating-point  calculations  only  retain  some  of  the  available 
digits  for  matrices  no  larger  than  H13  and  H20,  respectively. 

Matula  and  Ferguson  also  give  estimated  and  observed  numbers  of  dec¬ 
imal  digits  lost  in  computing  the  inverses  of  the  matrices  =  DH„D  for 
D  a  diagonal  matrix  whose  rth  entry  is  the  ith  root  of  a  randomly-chosen, 
108-binary-digit  number  in  the  interval  (0,1).  They  give  results  for  the  worst 
cases  with  25  different  choices  of  the  initial  random  number.  The  condition 
number  of  an  HJ,  is  the  same  as  that  of  H„,  but  its  ideal  entries  <ire  irrational. 
The  estimated  number  of  digits  lost  for  inverting  an  HJ,  is  the  same  as  the 
number  for  inverting  H„.  Matula  and  Ferguson  give  observed  digit- loss  re¬ 
sults  for  calculations  using  straightforward  double-precision  arithmetic  and 
simulated  approximate  rational  arithmetic,  computed  with  double-precision 
floating-point  and  m  =  2®®.  In  the  simulated  rational  axithmetic  calculations, 
the  initial  entries  of  the  HJ,  are  rounded  to  rationals  pjq  such  that  pq  <  2®®. 

The  results,  given  in  Figure  5  in  [FM85],  show  that  the  aetual  num¬ 
bers  of  digits  lost  for  both  floating-point  and  simulated  approximate  rational 
aiitlunetic  calculations  are  virtually  identical,  and  both  are  virtually  identi¬ 
cal  to  the  estimated  numbers  of  digits  lost  given  by  the  condition  numbers. 
The  floating-point  results  are  typically  better  by  a  fraction  of  a  decimal 
digit.  These  results,  together  with  the  results  on  inverting  Hilbert  matrices, 
strongly  support  the  assertion  that  for  values  stored  in  comparable  numbers 
of  bits,  floating- slash  arithmetic  would  produce  much  better  results  than 
floating-point  on  calculations  whose  ideal  values  are  rationaJ,  and  would  pro¬ 
duce  comparable  results  on  calculations  whose  ideal  values  are  irrational. 

5.1.5  Potential  Applications 

The  theoretical  and  empirical  results  given  above  both  indicate  that  floating- 
slash  arithmetic,  particularly  floating-slash  arithmetic  that  includes  extended 
floating-slash  values,  would  perform  as  well  as  floating-point  for  typical  appli¬ 
cations  and  perform  significantly  better  than  floating-point  for  applications 
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doing  calculations  on  rational  numbers.  The  chief  disadvantage  of  floating- 
slash  is  that  it  is  slower,  though  its  other  disadvantages  —  not  being  ap¬ 
propriate  for  interval  arithmetic,  and  presumably  producing  computational 
errors  that  are  harder  to  analyze  and  bound  —  could  also  be  significant.  As 
a  practical  matter,  hardware  doing  floating-point  arithmetic  is  also  widely 
available,  while  hardware  doing  floating-slash  arithmetic  is  just  now  being 
developed. 

Matula  and  Kornerup  [MK85]  list  these  potential  applications  for  systems 
doing  approximate  rational  arithmetic; 

•  Symbolic  computation  programs  that  mix  exact  rational  and  approxi¬ 
mate  real  arithmetic; 

•  Combinatorial  optimization  problems,  as  in  linear  programming  with 
sparse  0-1  constraint  matrices;  and 

•  Knowledge- based  systems  applications  in  which  it  is  critical  to  recog¬ 
nize  simple  rationals. 

We  were  unable  to  find  applications  in  command  and  control  software 
where  rational  numbers  arose  to  a  significant  extent.  In  particular,  they  did 
not  arise  in  the  fragment  of  the  Hostile  Booster  Interception  code  [App87] 
that  the  Reals  project  examined,  and  we  learned  of  no  such  applications 
through  our  inquiries  with  our  Air  Force  contract  monitors  £ind  with  experts 
in  Operations  Research. 

Floating-slash  <irithmetic  thus  seems  to  be  a  number  representation  sys¬ 
tem  with  advantages  over  floating-point,  but  advantages  that  have  not  yet 
been  obvious  enough  to  inspire  its  widespread  use.  Significantly,  Knuth 
[Knu81,  page  316]  lists  the  following  as  an  exercise  of  “significant  open  prob¬ 
lem  level”  difficulty: 

Modify  one  of  the  compilers  at  your  installation  so  that  it  will  re¬ 
place  all  floating-point  calculations  by  floating-slash  calculations. 
Experiment  with  the  use  of  slash  arithmetic  by  running  existing 
programs  that  were  written  by  progranuners  who  actually  had 

floating-point  arithmetic  in  mind . Are  the  results  better  or 

worse  when  floating-slaish  numbers  are  substituted? 
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Note  further  that  the  simple  simulator  program  used  by  Matula  and  Fer¬ 
guson  [FM85],  and  described  above,  captures  much  of  the  advantage  of  me¬ 
diant  rounding,  with  mediant  rounding’s  bias  in  favor  of  simple  results,  for 
computations  on  rational  numbers,  and  does  it  with  existing  floating-point 
hardware.  Although  a  hardware  implementation  of  floating-slash  arithmetic 
might  produce  still  more  accurate  results  —  the  simulator  computes  only 
an  approximation  to  floating-slash  arithmetic  —  the  main  advantage  of  such 
hardware  would  be  its  greater  speed.  For  computations  involving  rational 
numbers  for  which  speed  is  not  a  factor,  the  simulator  will  presumably  suf¬ 
fice. 

Neither  the  fixed-slash  nor  floating-slash  representation  is  appropriate  as 
a  means  for  representing  the  endpoints  of  intervals,  and  neither  facilitates 
parallel  computation,  though  parts  of  individual  operations  can  be  done  in 
parallel. 


5.2  The  LCF  Representation  System 

This  section  defines  the  Lexicographically-Coded  Continued  Fraction  (LCF) 
representation  by  Matula  and  Kornerup  [MK83,KM85,KM87,KM88].  This 
representation  gives  a  binary  encoding  of  finite  initial  segments  of  real  num¬ 
bers’  standau’d  continued  fractions;  for  the  remainder  of  this  section,  we  will 
assume  all  continued  fractions  are  standard  unless  we  specifically  note  oth¬ 
erwise.  If  the  number  of  bits  available  for  each  binary  encoding  is  fixed, 
the  LCF  representation  becomes  a  form  of  approximate  rational  arithmetic. 
The  encoding  hais  the  convenient  property  that  the  lexicographic  order  on 
real  numbers’  binary  encodings  is  equivalent  to  numerical  order  on  the  real 
numbers,  hence  the  representation’s  name. 

With  the  LCF  representation,  arithmetic  can  be  performed  in  an  on-line 
fashion.  (On-line  arithmetic  was  described  in  Section  3.4.)  The  LCF  repre¬ 
sentation  thus  facilitates  parallel  computation,  and  combines  the  awivantages 
of  on-line  arithmetic  with  those  of  approximate  rational  arithmetic.  Further, 
the  LCF  representation  supports  performing  computations  to  a  specified  pre¬ 
cision.  Representation  systems  giving  on-demand  accuracy,  and  carrying  out 
calculations  to  precisions  determined  by  the  numbers  being  calculated,  will 
be  considered  in  Chapter  6. 
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This  section  defines  the  LCF  representation,  discusses  its  efficiency  cis  a 
means  of  storing  real  numbers,  and  describes  how  arithmetic  can  be  per¬ 
formed  on  numbers  given  in  their  LCP  forms.  It  ends  with  a  brief  discussion 
of  hew  redundantly-coded  binary  numbers  or  generalized  continued  fractions 
might  be  used  to  avoid  a  throughput  problem  that  can  arise  in  LCF  arith¬ 
metic. 


5.2.1  The  LCF  Encoding 

We  will  present  the  LCF  encoding  by  first  giving  an  encoding  for  positive  in¬ 
tegers,  then  an  encoding  for  the  continued  fractions  of  nonnegative  rationals, 
then  an  encoding  for  arbitrary  continued  fractions,  and  finally  an  encoding 
for  arbitrary  real  numbers.  All  encodings  are  in  binary,  and  «ure  given  by 
Matula  and  Kornerup  in  [MK83]. 

Every  positive  integer  m  has  a  base-2  representation 

m  =  (16„_i6„_2  •  •  •  60)2  =  2"  +  &n-i2”  ^  -1-  •  •  •  +  6o- 

The  bit  string  l"06n-itn-2  •  •  •  where  1”  denotes  n  consecutive  I’s,  then 
gives  a  self-delimiting  binary  encoding  of  m;  the  code  for  1  can  be  taken 
to  be  0,  and  -foo  can  be  taken  to  be  an  infinite  string  of  I’s.  Since  these 
codes  are  self-delimiting,  it  is  possible  to  concatenate  the  codes  for  a  sequence 
of  positive  integers  into  a  single  bit-string,  then  unambiguously  deduce  the 
sequence  of  integers  from  this  string.  Note  that  the  lexicographic  order  on 
these  codes  matches  the  numerical  order  on  the  corresponding  integers. 

Every  nonnegative  rational  has  a  continued  fraction  [qoi  •  •  •  >  Qk]  such  that 
qo  >  0  and  qi  >  0  for  all  i  ^  0.  This  continued  fraction  can  be  thought  of 
as  [^oj-  •  •  »9fc)+oo])  so  -1-00  can  be  used  as  an  end-marker.  As  we  noted  in 
Section  4.3,  the  continued  fraction  can  be  chosen  so  that  k  is  even.  If  i  is 
even,  let  s,-  be  the  self-delimiting  binary  code  for  qi,  and  otherwise  let  it  be 
the  I’s  complement  of  the  self-delimiting  binary  code  for  qi. 

Code  the  nonnegative  rational  with  continued  fraction  [^oj  •  •  • ,  9*]  as  fol¬ 
lows:  If  qo  =  0  —  i.e.,  if  the  rational  is  less  than  1  —  take  the  first  bit  of 
the  code  to  be  0  and  take  the  rest  of  the  code  to  be  the  concatenation,  for 
all  1  <  i  <  fc,  of  the  s,.  If  qo  >  0>  take  the  first  bit  of  the  code  to  be  1  and 
take  the  rest  of  the  code  to  be  the  concatenation,  for  all  0  <  t  <  A:,  of  the  s^. 
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Note  that  the  code  for  0  is  0,  and  that  the  code  for  every  rational  ends  with 
an  infinite  string  of  O’s.  Further  note  that  since  increasing  an  even-indexed 
partial  quotient  of  a  (standard)  continued  fraction  makes  the  corresponding 
real  larger,  while  increasing  an  odd-indexed  partial  quotient  makes  that  real 
smaller,  the  lexicographic  order  on  these  codes  matches  the  numerical  order 
on  the  corresponding  rationals. 

For  an  arbitrary  rational  r,  if  r  >  0  code  r  as  1  followed  by  the  coding 
just  given  of  |r|,  and  if  r  <  0  code  r  as  0  followed  by  the  2’s  complement  of 
the  coding  just  given  of  |r|.  (Taking  2’s  complements  is  necessary  to  have 
the  codes  always  end  with  infinite  strings  of  O’s.)  Lexicographic  order  on  the 
codes  still  matches  numerical  order  on  the  corresponding  rationals. 

Here  are  several  examples,  taken  from  {MK83],  of  this  coding  for  arbitrary 
rationals.  They  illustrate  why  various  aspects  of  the  coding  are  necessary. 
Every  code  ends  with  an  infinite  string  of  O’s. 


8 

mil 

-1/8 

01111 

4 

11110 

-1/4 

OHIO 

3 

11101 

-1/3 

01101 

2 

11100 

-1/2 

01100 

5/3 

non 

-3/5 

01011 

3/2 

llOIO 

-2/3 

01010 

1 

11000 

-1 

01000 

2/3 

10110 

-3/2 

00110 

3/5 

10101 

-5/3 

00101 

1/2 

10100 

-2 

00100 

1/3 

10011 

-3 

00011 

1/4 

10010 

-4 

00010 

1/8 

10001 

-8 

00001 

0 

10000 

±oo 

00000 

Finally,  lor  an  arbitrary  real,  code  the  real  as  the  limit  of  the  codes  for 
the  initial  segments,  rewritten  if  necessary  to  make  their  final  indexes  even, 
of  that  real’s  continued  fraction.  Lexicographic  order  on  the  codes  matches 
numerical  order  on  the  corresponding  reals. 
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5.2.2  LCF  Gap  Sizes  and  Range 


We  will  present  theoretical  and  empirical  results  from  [KM85]  on  the  sizes 
of  the  gaps  between  consecutive  values  representable  in  an  LCF  system  for 
which  there  is  a  fixed  bound  on  the  number  of  bits  in  an  LCF  code.  By 
abuse  of  terminology,  we  will  use  “the  LCF  encoding”  to  refer,  depending 
on  the  context,  to  either  the  encoding  of  positive  integers,  the  encoding  of 
nonnegative  rationals,  or  the  encoding  of  arbitrary  rationals  just  described. 
Note  that  the  bit-length  of  a  positive  integer’s  LCF  code  increases  by  1  if 
the  integer  is  interpreted  as  a  nonnegative  rational,  and  the  bit-length  of  a 
nonnegative  rational’s  LCF  code  increases  by  1  if  the  nonnegative  rational  is 
interpreted  as  an  arbitrary  rational. 

If  a  rational  p/q  is  chosen  randomly  and  uniformly  over  [0, 1],  for  suffi¬ 
ciently  large  i  the  probability  that  the  ith  partial  quotient  of  this  rational’s 
partial  fraction  is  j  is  [Knu81] 


(;  +  i) 

j{j  +  2) 


By  information-theoretic  arguments,  an  optimal  coding  for  this  frequency 
distribution  would  use 

bits  to  code  j.  With  the  LCF  encoding,  the  expected  number  of  bits  per 
partial  quotient  for  rationals  chosen  uniformly  over  [0, 1]  is  then  3.52,  while 
for  the  optimal  coding  the  expected  number  of  bits  per  partial  quotient  would 
be  3.45  [KM85|. 

In  particular,  the  LCF  code  for  1  is  0,  which  requires  only  1  bit,  while  an 
optimal  code  would  use  1.27  bits.  The  LCF  codes  for  2  and  4  axe  100  and 
11000  respectively,  requiring  3  and  5  bits,  while  an  optimal  code  would  use 
only  2.56  and  4.09  bits  for  these  numbers.  For  LCF  codes  with  only  k  bits,  in 
the  interval  [0, 1]  the  smallest  gap  sizes  between  consecutive  LCF  values  can 
thus  be  expected  near  the  rational  with  continued  fraction  [0, 1, 1, 1, . . .  ,1] 
and  LCF  code  01010101. ..  01,  while  the  largest  gap  sizes  can  be  expected 
near  the  rational  with  continued  fraction  [0,2,4, ...  ,2,4]  and  LCF  code 
00111100001111  ...00001111;  in  the  first  case  the  gap  sizes  grow  asymp- 
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totically  cis  2  in  the  second  case  they  grow  asymptotically  as 

2-o.8268fc  [KM85]. 

Matula  cind  Kornerup  give  exhaustive  or  Monte  Carlo  analyses  of  gap 
sizes  on  [0, 1]  for  fc-bit  LCF  codes,  tind  the  maximum  and  minimum  gap  sizes 
are  consistent  with  these  expectations.  In  all  cases,  gap  sizes  vary  between 
and  and  keep  this  range  of  variation  even  as  k  increases. 

However,  as  k  incre^ises  the  gap  sizes  become  log-normally  distributed  around 
exactly  the  gap  size  for  2*  uniformly-distributed  points  in  [0, 1].  For 
k  =  128,  99.9%  of  the  gap  sizes  are  between  2“^  '^*  and  Thus  even 

though  the  extreme  gap  sizes  persist  as  k  increases  in  being  exponentially 
larger  or  smaller  than  the  gap  size  for  uniformly-distributed  points,  typical 
gap  sizes  become  closer  to  those  for  uniformly-distributed  points  [KM85]. 
The  A-bit  LCF  codes  are  not  cisymptotically  uniformly  dense  on  [0,1]  as  the 
A:-bit  fixed-slash  values  are;  this  shows  the  dependence  of  LCF  codes  on  the 
base  2. 

LCF  codes  also  do  not  have  the  ability  to  represent  numbers  of  greatly 
varying  magnitudes  in  fixed  numbers  of  bits,  one  of  the  great  advantages 
of  the  floating-point  and  extended  floating-slash  representations.  Matula 
and  Kornerup  suggest  using  a  separate,  independent  representation  of  the 
large-magnitude  partial  quotient  q  in  continued  fractions  of  the  forms  [9, . . .], 
[0, 9, . . .]  and  [—  1 , 1 , 9, . . .]  to  extend  the  range  of  representable  values  [MK83]. 
Another  possibility,  similar  to  one  of  the  means  for  representing  rationals 
given  in  our  interim  report  (see  Section  4.1),  is  to  represent  reals  as  fractions 
in  the  interval  (  —  1,1)  multiplied  by  powers  of  2;  the  fractions  could  be  given 
by  LCF  codes  and  the  powers  by  an  integer  exponent.  We  will  note  how 
arithmetic  operations  might  be  performed  on  numbers  represented  in  this 
way  in  the  next  subsection. 


5.2.3  LCF  Arithmetic 

As  we  described  earlier,  in  Gosper’s  algorithm  the  necessary  updates  to  the 
coefficient  and  decision  cubes  can  be  made  by  multiplying  by  matrices  of  the 
forms 

■  0  1 

1  p 


1  1 
P  P-1 


,  and 


0  1 
1  -p 
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for  nonnegative  integers  p.  If  p  >  0,  by  the  base-2  representation  of  p, 
p  =  2"  -|-  6„_i2"“’  -f  •  •  ■  -H  6o  for  some  collection  of  6,,  with  each  6,  6  {0, 1}. 

These  matrices  can  be  factored  as  follows; 


’or 

■  1  0  ■ 

n 

01 

■  1/2  6„_i/2  ■ 

■  1/2 

V2  ■ 

1  p . 

0  2 

1  1 

0  1 

1 - 

0 

1 

■  1  1  ■ 

■  1  0  ■ 

n 

'll' 

p  p  -  1 

1  2 

10 

and 


i+to  hi 

2  2 

1-fen  2-fen  J 
2  2  . 


■  0  1  ■ 

10' 

n 

■  0  1  ■ 

■  1/2  -6„_,/2  ■ 

1/2 

-60/2  ■ 

_  1  -p 

0  2 

1  -1 

0  1 

0 

1 

Note  that  there  is  a  one-to-one  correspondence  between  the  factors  in 
these  factorizations  and  the  bits  in  the  LCF  code  for  p.  In  Gosper’s  algorithm, 
it  is  thus  possible  to  ingest  individual  bits  of  the  LCF  codes  for  the  partial 
quotients  of  x  and  y,  and  output  individucil  bits  of  the  LCF  codes  for  the 
partial  quotients  of  2:.  It  is  only  necessary  to  maintain  status  flags  recording 
things  such  as  whether  the  partial  quotient  being  ingested  has  an  even  or 
odd  index  to  keep  track  of  whether  each  bit  ingested  should  be  interpreted 
as  itself  or  its  negation  and  whether  each  bit  output  should  be  output  as  itself 
or  its  negation.  It  is  thus  possible  to  do  approximate  rational  arithmetic  at 
the  bit  level,  and  produce  more  accurate  results  if  they  axe  demanded  just 
by  outputting  more  bits. 

Every  input  bit  corresponds  to  a  shift  or  shift-and-add-or-subtract  oper¬ 
ation.  If  p  =  0,  the  necessary  updates  can  also  be  made  with  a  single  such 
operation.  Further,  four  of  the  eight  necessary  coefficient  updates  can  be 
performed  in  peu-allel  [KM88]. 

In  producing  output,  it  is  not  necessary  to  ingest  inputs  until  the  next 
partial  quotient  of  z  is  determined,  but  only  until  the  next  bit  of  the  LCF 
encoding  of  the  next  partial  quotient  of  z  is  determined.  This  generally 
makes  it  possible  to  produce  output  much  earlier.  (This  method  of  improv¬ 
ing  throughput  is  more  natural  than  the  idea  suggested  by  Jones  [Jon84] 
mentioned  in  our  interim  report;  Jones  suggests  giving  partial  quotients  as 
sequences  of  nested  intervals  and  using  these  intervals  to  get  tighter  bounds 
on  the  extreme  values  of  z.) 
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A  general  description  of  the  extension  of  Gosper’s  algorithm  to  LCF  codes 
follows.  A  more  detailed  description  is  in  [KM88].  For  simplicity,  our  de¬ 
scription  cissumes  that  both  x  and  y  are  positive  cind  says  “output  0”,  say, 
instead  of  “output  whichever  of  0  or  1  would  be  interpreted  as  the  bit  0  in 
the  LCF  code  for  the  partial  quotient  of  z  currently  being  produced”. 

1.  Initialize  the  counter  c  to  0. 

2.  Input  bits  from  x  and  y  until  it  is  possible  to  determine  that  z  >  2  or 
that  the  first  partial  quotient  of  z  is  1  —  i.e.,  that 

2<AfE,BIF,CIG,DIH 

or  that 

1  <  AIE,BjF,CIG,DIH  <  2. 

In  the  first  case,  output  1,  perform  the 

1  0 
0  2 

transformation  for  output  to  z,  increment  the  counter  c,  and  repeat 
step  2.  (This  has  the  effect  of  dividing  z  by  2  and  continuing.)  In  the 
second  Ccise,  output  0,  perform  the 

0  1 

1  -1 

transformation  for  output  to  z,  and  go  to  step  3. 

3.  While  c  >  0,  decrement  c,  then  input  bits  from  x  and  y  until  it  is 
possible  to  determine  that  z  >  2  or  that  the  first  partial  quotient  of  z 
is  1.  In  the  first  case,  output  0  and  perform  the 

■  1/2  o' 

0  1 

transformation  for  output  to  z;  in  the  second  case,  output  1  ajid  per¬ 
form  the 

1/2  -1/2  ■ 

0  1 

transformation  for  output  to  z.  After  c  =  0,  go  back  to  step  2  to 
produce  the  next  partial  quotient  of  z. 


64 


Matula  and  Kornerup  give  estimates  of  the  average  numbers  of  partial 
quotients  in  the  continued  fractions  for  rationals  in  the  set  Hn  of  hyperbolic 
fractions.  They  also  give  estimates  of  the  numbers  of  shift  and  add/subtract 
operations  that  must  be  performed  per  partial  quotient  for  arithmetic  oper¬ 
ations  that  take  members  of  //„  as  inputs  and  produce  members  of  Hn  as 
outputs  [KM87].  These  estimates  are  based  on  the  assumption  that  bits  are 
output  at  roughly  the  same  rate  they  are  ingested,  an  assumption  that  is  not 
always  correct  and  is  discussed  more  fully  in  the  next  subsection. 

Since  Gosper’s  algorithm  can  compute  operations  other  than  axidition, 
subtraction,  multiplication  and  division  by  making  appropriate  initializations 
of  the  coefficient  cube,  it  could  be  used  to  perform  arithmetic  operations 
on  reals  given  as  rationals  in  the  interval  (—1,1)  times  a  power  of  2.  For 
example,  for  reals  given  as  x  •  2*  <ind  y  •  2-’,  where  i  and  y  are  given  by  LCF 
codes,  Gosper’s  algorithm  could  be  used  to  compute  x  -b  y  by  initializing  the 
coefficient  cube  to  a  =  0,  6  =  2‘,  c  =  2'',  d  =  0,  e  =  0,  /  =  0,  y  =  0  and 
h  =  1.  Questions  related  to  this  approach  are  noted  in  Chapter  7. 

5.2.4  Possible  Uses  of  Redundancy 

Although  the  extension  of  Gosper’s  algorithm  to  LCF  codes  is  usually  able 
to  output  the  next  bit  of  z  after  ingesting  only  a  few  bits  of  x  and  y  — 
in  simulations,  Matula  and  Kornerup  found  typical  delays  of  about  5  bits 
[KM87]  —  there  are  situations  where  potentially  infinite  numbers  of  bits  must 
be  ingested  from  x  and  y  to  determine  the  next  output  bit  for  z.  Matula  and 
Kornerup  [KM88]  give  an  example  where  the  result  is  either  slightly  greater 
than  2,  with  LCF  form  11000- • -01  •••,  or  slightly  less  than  2,  with  LCF 
form  10111  •  •  •  10  •  •  -.  This  ex2unple  is  equivalent  to  not  being  able  to  decide 
whether  a  number’s  continued  fraction  is  [2,  A:]  or  [1,1,  m]  for  large  integers 
k  and  m. 

This  is  exactly  the  sort  of  problem  that  arises  with  nonredundant  repre¬ 
sentations  of  numbers.  Matula  and  Kornerup  are  currently  investigating  the 
possibility  [KM88]  of  using  the  redundant  “bit”  set  {0,1,1},  where  I  =  — 1, 
to  avoid  this  problem.  In  the  exaunple,  the  arithmetic  unit  could  output  1 
and  “correct”  it  if  necessary  later  by  outputting  I.  They  hope  in  this  way  to 
guarantee  uniform  throughput. 


As  we  noted  in  Section  4.5,  the  redundancy  in  generalized  continued  frac¬ 
tions  allows  these  continued  fractions  to  avoid  the  “unlimited  wait  for  the 
next  partial  quotient”  problem  in  Gosper’s  algorithm.  They  could  also  be 
used  to  guarantee  uniform  throughput.  Related  results  and  questions  are 
given  in  Chapters  6  and  7. 

The  second  major  problem  with  Gosper’s  algorithm,  that  of  how  big  the 
entries  of  the  coefficient  cube  become  as  the  inputs  are  ingested,  also  arises 
for  LCF  arithmetic.  In  their  simulations,  Matula  and  Komerup  noted  that 
these  elements  seem  to  grow  about  as  quickly  as  bits  are  ingested  from  the 
inputs,  so  order  of  i-bit  registers  for  storing  the  coefficients  should  work  for 
performing  operations  on  A:-bit  LCF  codes.  They  hst  accurately  determining 
the  necessary  size  of  these  registers  as  a  problem  for  future  reseju-ch  [KM88]. 

It  is  possible  that  using  a  redundant  bit  set  will  not  only  increase  through¬ 
put,  but  also  decrease  the  necessary  sizes  of  the  coefficient-cube  registers.  Our 
results  on  the  growth  of  the  coefficient-cube  entries  for  versions  of  Gosper’s 
algorithm  producing  standard  and  generalized  continued  fractions  are  de¬ 
scribed  in  Subsection  6.4.2;  they  are  essentially  contrary  to  our  expectation 
that  increased  throughput  would  reduce  the  growth  of  these  entries.  The 
problem  of  trying  to  limit  the  sizes  of  the  necessary  registers  by  using  other 
possible  encodings  also  inspired  a  question  for  future  research  in  Chapter  7. 


5.3  Hybrid  Floating-Point  and  Fixed- Slash 

Hwang  and  Chang  [HC78]  propose  an  extension  of  binary  floating-point  that 
allows  them  to  represent  many  rational  numbers  exactly.  As  in  ordinary 
floating-point,  they  represent  each  value  as  a  mantissa  times  a  power  of 
2,  and  as  in  ordinary  floating-point  a  mantissa  can  be  a  radix  fraction,  or 
norma, Ized  base-2  number  in  the  interval  [1/2,1).  Unlike  ordinary  floating¬ 
point,  aowever,  the  mantissa  ran  also  be  the  ratio  of  two  integers  p  and  q 
such  that  1/2  <  p/q  <1.  If  there  are  2k  bits  in  the  field  determining  the 
mantissa,  Hwcing  and  Chang  represent  the  fraction  p/q  as  two  A;-bit  integers 
q  —  p  and  q.  Since  it  is  always  true  that  q  —  p<  qf2,  with  this  representation 
the  first  bit  of  the  field  determining  the  mantissa  is  always  0  if  the  mantissa 
is  a  rational  and  is  always  1  if  it  is  a  2A:-bit  radix  fraction. 
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Let  y?2fc  be  the  2fc-bit  radix  fractions,  and  let 


U2k  =  F^k  U  R2k 


be  the  union  of  the  2A:-bit  radix  fractions  with  the  Farey  fractions  .  Hwang 
and  Chang  prove  that  there  is  at  most  one  member  of  F2*  in  the  gap  between 
two  consecutive  members  of  il2/t>  they  define  a  rounding  mapping  each  real 
number  x  in  [1/2, 1)  to  the  nearest  value  in  U2ki  and  they  give  an  algorithm 
based  on  computing  the  continued  fraction  expansion  of  x  for  computing  this 
rounding. 

They  perform  every  operation  as  if  its  result  were  first  computed  exactly, 
then  rounded  to  the  nearest  member  of  U2k-  Although  they  do  not  mention 
the  possibility,  their  arithmetic  could  presumably  be  extended  to  support 
each  of  the  four  rounding  modes  —  to-nearest,  upward,  downward,  toward- 
zero  —  available  in  IEEE  floating-point  arithmetic  [IEE185]. 

They  compare  relative  errors  in  rounding  real  numbers  in  the  interval 
[1/2,1)  to  the  nearest  radix  fraction  and  to  the  nearest  value  representable 
in  their  system  —  i.e.,  they  compare  relative  errors  in  rounding  these  reals  to 
the  nearest  values  in  R2k  and  in  U2k-  For  a  real  number  i  in  this  interval,  let 
p{x)  be  the  machine-representable  value  x  is  rounded  to  in  whichever  of  J?2fc 
and  U2k  is  currently  being  considered.  Let  the  relative  representation  error 
for  a  real  number  i  G  [1/2,1)  be  |(/»{x)  —  x)/x|. 

Using  uniformly-distributed  rationals  in  the  interval  [1/2,1),  chosen  so 
that  the  difference  between  successive  rationals  is  smaller  than  the  smallest 
gap  in  1/2*,  Hwang  and  Chang  show  that  for  8  <  Jk  <  20  the  average  relative 
representation  error  when  these  rationals  are  approximated  by  members  of 
U2k  is  between  10.2%  and  11.4%  lower  thaji  the  average  relative  representa¬ 
tion  error  when  they  are  approximated  by  members  of  i?2fc-  Their  representa¬ 
tion  system  thus  gives  a  10%  improvement  in  relative  accuracy,  without  using 
any  more  bits,  over  a  version  of  floating-point  that  does  not  have  implicitly- 
normalized  mantissas;  a  version  of  floating-point  with  implicitly-normalized 
mantissas  would  have  one  more  bit  of  precision,  with  a  roughly  50%  improve¬ 
ment  in  relative  representation  error.  Their  system  also  makes  many  exact 
rational  calculations  possible. 

Hwang  and  Chang  presume  operations  in  their  system  will  be  slower  than 
those  in  floating-point  arithmetic,  but  propose  an  arithmetic  processor  with 


67 


a  pipelined  design  that  they  hope  can  alleviate  this  problem.  They  give  no 
performance  results  or  estimates. 

If  upward  and  downward  rounding  rounding  were  avciilable  in  their  arith¬ 
metic,  their  representation  would  only  be  slightly  worse  than  floating-point 
with  implicitly-normalized  mantissas  as  a  mezuis  for  representing  the  end¬ 
points  of  intervals.  It  does  not  facilitate  parallel  computation. 


5.4  Variable- Length  Exponents 

Iri  tind  Matsui  [MI81]  propose  a  sensible  extension  of  floating-point  arith¬ 
metic.  Their  basic  idea  is  to  use  binary  floating-point  numbers  with  a  vari¬ 
able  number  of  bits  in  the  field  determining  the  exponent.  In  this  way,  in 
numbers  with  moderate  exponents  more  bits  can  be  used  to  make  the  man¬ 
tissa  more  precise,  and  in  numbers  with  extreme  exponents  bits  that  would 
otherwise  be  used  for  the  mantissa  can  be  used  to  store  the  exponent.  The 
resulting  values  are  thus  more  accurate,  using  the  scume  number  of  bits,  than 
ordinary  floating-point  values  for  numbers  of  moderate  size,  and  have  such  a 
iarge  range  that  overflow  and  underflow  are  virtually  impossible. 

Iri  and  Matsui  credit  Morris  (Mor73)  with  the  idea  of  using  a  variable- 
length  exponent  to  make  more  bits  of  precision  available  for  moderately-sized 
numbers;  their  innovation  was  using  the  same  concept  to  make  overflow  or 
underflow  virtually  impossible. 

More  specifically,  they  propose  for  64- bit  vedues  that  the  first  bit  code 
a  sign  and  the  final  6  bits  code  aui  exponent- length  of  0  through  57.  If  the 
exponent- length  value  is  n,  the  n  last  bits  before  the  final  6  bits  code  an 
exponent,  including  its  sign;  if  n  =  0  the  exponent  is  0.  The  remmning 
64  —  1—6  —  7)  bits  code  a  binary  mantissa;  if  n  =  57  the  mantissa  is  teiken  to 
be  1.  If  n  <  57,  the  mantissa  is  implicitly  normalized  to  fall  in  the  interval 
(1/2,1)  ,  so  has  an  implicit  leading  bit  of  1.  Values  of  the  exponent- length 
field  from  58  to  63  code  infinite  and  “Not  a  Number”  values. 

Actually,  the  last  paragraph  defines  only  level  0  of  the  representation  Iri 
and  Matsui  propose.  They  <ilso  propose  level- 1  values  in  which  the  mantissa 
is  taken  to  be  1  and  the  value  given  by  the  mantissa  and  exponent  fields 
is  the  value  of  the  exponent.  For  example,  except  for  the  level  indicator  a 
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value  that  codes  0.5  •  2*^  at  level  0  codes  1  •  2°  ®'^”  at  level  1.  Iri  and  Matsui 
define  higher- level  vaJues  similarly,  so  that  a  level-n  value  codes  the  exponent 
of  a  level-(n  1)  value.  Values  of  the  exponent  length  between  58  and  63 
could  be  used  to  code  the  level  indicators.  Though  they  did  not  say  so,  it 
is  probably  true  that  i,  x  -i- 1  and  x  •  x  axe  indistinguishable  for  levels  no 
higher  than  about  6,  so  underflow  and  overflow  could  not  occur,  as  it  cannot 
in  Olver  and  Clenshaw’s  system  to  be  described  in  Section  5.6. 

Iri  and  Matsui  did  not  propose  specific  encodings  of  higher-level  num¬ 
bers,  and  did  not  carry  out  simulations  of  arithmetic  for  them  as  they  did 
for  level  0.  We  do  not  believe  having  higher-level  numbers  would  give  any 
significant  advantage  over  having  level  0  with  special  codes  for  ±oo.  The 
positive  numbers  representable  by  level  0  range  over  the  extremes 

2±2’®  _  2±22057S94037927936  ^  j^q±21691497220794363.8 

which  should  be  more  than  enough;  as  we  noted  in  our  interim  report,  there 
are  estimated  to  be  only  about  10®°  nucleons  in  the  known  universe  [FH65]. 

Demmel  [Dem87]  raises  the  objection  to  Iri  and  Matsui’s  representation 
system  that  it  makes  writting  programs  that  produce  results  with  a  guar¬ 
anteed  maximum  relative  error  when  overflow  and  underflow  do  not  occur 
more  difficult.  Demmel  also  notes,  however,  that  if  the  hardware  performing 
operations  for  Iri  and  Matsui’s  system  were  modified  to  maintain  a  regis¬ 
ter  giving  the  largest  exponent-length  that  has  arisen  in  calculations  since 
this  register  was  last  reset  under  program  control,  the  value  in  this  register 
could  be  used  to  get  a  lower  bound  on  the  lengths  of  the  mantissas  arising 
in  these  calculations.  This  bound  could  be  used  as  the  fixed  mantissa  length 
is  normally  used  to  establish  limits  on  the  accuracy  of  floating-point  results. 

We  believe  that  the  advantages  of  variable-length  exponents  outweigh 
their  disadvantages  for  realistic  applications  of  computer  zu’ithmetic,  partic¬ 
ularly  for  command  and  control  applications.  It  is  usually  more  useful  to 
know  a  number  with  less  precision  than  to  merely  know  that  it  overflowed 
or  underflowed.  We  believe  the  situation  is  analogous  to  that  of  the  par¬ 
tial  underflow  called  for  in  the  IEEE  floating-point  standard  [IEE85],  where 
some  of  the  mantissa  bits  are  discarded  to  produce  nonzero  numbers  with 
magnitudes  that  would  otherwise  be  too  small  to  represent.  (We  note,  how¬ 
ever,  that  partial  underflow  was  one  of  the  most  controversial  features  of  the 
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IEEE  standard;  c.f.,  [FW79,Coo81,Dem81].)  Variable-length  exponents  also 
give  greater  accuracy  for  moderately-sized  numbers,  numbers  that  arise  most 
often  in  typical  applications. 

Iri  and  Matsui  only  propose  performing  arithmetic  as  if  results  are  first 
calculated  exactly  and  then  rounded  to  the  nearest  representable  values  as  in 
IEEE  round-to-nearest  rounding.  That  is  the  form  of  rounding  used  in  their 
simulations  of  arithmetic  operations  for  their  representation.  There  seems 
to  be  no  reason,  though,  why  their  representation  could  not  be  used  with 
each  of  the  four  rounding  modes  possible  in  IEEE  floating-point  arithmetic 
[IEE85]. 

Iri  and  Matsui  [MI81]  give  several  examples  of  calculations  involving 
numbers  of  widely-varying  magnitudes,  particulcirly  calculations  of  binomial- 
distribution  probabilities  of  the  form 

j  ^  ^ 

for  nonnegative  integers  N  and  k  such  that  TV,  k  and  N  —  k  are  all  large, 
and  reals  p  and  q  such  that  p  +  q  =  1.  Simulations  of  calculations  using  their 
system  perform  significantly  better  than  do  either  IBM  360  or  IEEE-standard 
floating-point  arithmetic.  Their  arithmetic  is  also  not  nearly  so  vulnerable  as 
the  others  are  to  producing  inaccurate  results  because  of  underflow  or  being 
unable  to  complete  calculations  because  of  overflow. 

If  upward  and  downward  rounding  were  available  in  their  arithmetic,  their 
representation  would  be  significantly  better  than  floating-point  as  a  means 
for  representing  the  endpoints  of  intervals.  It  does  not  facilitate  parallel 
computation. 

We  propose  several  extensions  of  Iri  and  Matsui ’s  representation  in  Chap¬ 
ter  7,  including  an  extension  that  does  facilitate  parallel  computation.  These 
extensions  use  previously  described  ideas  from  Matula  and  Kornerup,  and 
an  idea  from  Aberth  [Abe88]  to  be  described  in  Chapter  6. 
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5.5  Recurring-Digits  Arithmetic 


Yoshida  [Yos83]  proposes  a  representation  system  based  on  extending  “recur¬ 
ring  decimals”  to  arbitrary  bases  and  representing  them  in  computer  calcula¬ 
tions  by  adjoining  “length  of  the  recurrent  portion  of  the  mantissa”  counts  to 
what  would  otherwise  be  ordinary  floating-point  values.  This  representation 
thus  provides  an  alternative  way  of  exactly  representing  simple  rationals  as 
well  as  ordinary  floating-point  values. 

Generalizing  ordinary  notation  for  recurring  decimals,  for  base  b  the  value 

O.cfi  •  •  •  dndn+l  •  •  •  dn+T 


represents  the  number 


6"  •  {b’’  -  1) 

in  the  interval  [0,1].  In  Yoshida’s  representation,  which  she  calls  FLP/R*, 
each  value  contains  a  “recurring  portion”  count,  a  sign,  a  mantissa  and  an 
exponent.  On  a  hypothetical  computer  with  base-10  eu-ithmetic  and  10  digits 
in  the  mantissas  of  its  floating-point  values,  for  example,  the  value  with 
recurring  portion  count  3,  positive  sign,  mantissa  3456756756  and  exponent 
4  would  represent  the  number  0.34567  ■  10^.  Because  the  recurring  portion  of 
e2Kdi  value  must  be  on  the  right  end  of  the  mantissa,  the  mantissa  need  not  be 
normalized.  On  the  same  hypothetical  computer,  for  exeunple,  the  number 
0.13  =  2/15  would  be  represented  by  the  FLP/R*  value  with  recurring- 
portion  count  1,  positive  sign,  mantissa  0000000013,  and  exponent  8. 

Unlike  floating-point,  there  is  duplication  in  the  numbers  represented  by 
the  mantissas  and  recurring-portion  counts  in  the  FLP/R*  system,  since 
in  ordinary  recurring  decimals  equalities  such  as  0.1  =  0.09  2md  0.34T2  = 
0.341212  occur.  As  might  be  expected,  though,  and  as  Yoshida  demonstrates 
with  empirical  results,  the  relative  effect  of  such  redundaincies  on  the  total 
number  of  different  numbers  representable  by  FLP/R*  values  with  fixed-size 
mantissas  becomes  progressively  less  as  the  size  of  the  mantissas  increases. 
Also  unlike  floating-point,  the  gaps  between  representable  values  with  the 
same  exponent  in  the  FLP/R*  system  are  not  uniform. 
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Arithmetic  can  be  performed  in  the  FLP/R*  system  by  expanding  out 
the  recurrent  portions,  if  any,  of  the  mantissas  of  the  values  being  combined 
to  a  sufficient  number  of  digits,  performing  the  desired  operations  on  these 
expanded  values  as  in  ordinary  floating-point  arithmetic,  and  then  rounding 
the  computed  values  to  the  nearest  values  in  the  FLP/R*  system.  Yoshida 
gives  the  maximum  number  of  digits  that  each  operand’s  mantissa  must  be 
expanded  to  for  each  of  the  arithmetic  operations;  this  number  of  digits  is 
never  more  than  twice  the  sum  of  the  number  of  digits  in  the  operand’s 
mantissa  and  the  length  of  its  recurrent  portion. 

Since  simply  incrementing  the  last  digit  in  a  mantissa  gives  a  new  value 
representing  a  number  at  least  as  large  as  any  number  represented  by  the 
original  mantissa  with  some  recurring- portion  count,  the  number  of  possible 
FLP/R*  values  with  a  fixed  number  of  digits  in  their  mantissas  that  must  be 
considered  to  find  the  one  closest  to  a  computed  expanded  result  is  at  most 
one  larger  than  this  fixed  number  of  digits.  As  an  example,  in  a  base- 10 
system  with  6-digit  mantissas  the  possibilities  that  must  be  considered  in 
rounding  the  expanded  result  485656524232,  written  to  make  the  example 
clearer  as  485656  524232,  are  just 

485656000000, 

485656666666, 

485656565656, 

485656656656, 

485656565656, 

485656856568, 

485656485656,  and 
485657000000. 

The  closest  of  these  is  the  next-to-last  one,  so  in  the  FLP/R*  system  the 
expanded  result  would  be  rounded  to  a  value  with  mantissa  485656  and 
recurring-portion  count  6. 

Although  she  only  discusses  arithmetic  performed  as  if  the  results  were 
first  computed  exactly  and  then  rounded  to  the  nearest  representable  value, 
the  same  arguments  that  limit  the  number  of  possibilities  that  must  be  con¬ 
sidered  in  rounding  the  expanded  results  also  limit  the  number  of  possibilities 
that  must  be  considered  in  other  roundings;  her  representation  could  be  used 
for  each  of  the  four  roundings  possible  in  IEEE  arithmetic  [IEE85]. 


Yoshida  does  not  give  possible  hardware  implementations  of  the  FLP/R* 
system,  and  does  not  discuss  the  critical  question  of  operation  speed.  She 
produces  her  empirical  results  with  an  FLP/R*  simulator. 

Although  she  does  not  propose  a  specific  binary  implementation  of  her 
representation  system  that  would  make  storage-efficiency  comparisons  pos¬ 
sible,  the  recurring-portion  counts  probably  cause  her  representation  to  be 
significantly  less  efficient  in  using  storage  than  floating-point;  contrast  this 
with  the  use  of  storage  in  Hwang  and  Chang’s  hybrid  floating-point/fixed- 
slash  described  in  Section  5.3.  It  would  thus  not  serve  as  well  as  a  means 
of  representing  the  endpoints  of  intervals.  The  representation  also  does  not 
facilitate  ptirallel  computation. 


5.6  Hyper-Exponential  Representations 


Olver  and  Clenshaw  [01v87]  propose  a  representation  system  that  is  closed 
under  the  usual  arithmetic  operations,  so  overflow  and  underflow  are  not 
possible.  Actually,  they  propose  two  systems,  one  of  level-index  and  the 
other  of  symmetric  level-index  arithmetic.  The  level- index  system  is  a  simpler 
special  case  of  the  symmetric  level-index  system.  The  level-index  system  is 
immune  from  overflow  while  the  symmetric  level-index  system  is  immune 
from  both  overflow  and  underflow. 

The  level-index  system  uses  the  “generalized”  exponential  and  logarithm 
functions,  ^  and  0,  defined  for  nonnegative  real  numbers  x  and  X  by 


r  X  if  0  <  X  <  1, 
^  otherwise; 


and 


{^(logX)-f-l 


ifO<X<  1, 
otherwise. 


The  level-index  system  represents  the  real  X  as  a  sign,  a  level,  which  is  the 
integer  portion  of  r/)(|X|),  and  an  index,  which  is  a  fixed-point  approximation 
to  the  noninteger  portion  of  V’(1X|). 


The  symmetric  level-index  system  uses  negative  index  values  to  denote  the 
reciprocals  of  values  in  the  level-index  system.  Its  analogs  of  the  generalized 
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exponential  and  logarithm  functions,  $  and  ^^5  are  defined  for  real  numbers 
X  and  positive  real  numbers  X  in  terms  of  the  previously-defined  functions 
(f)  and  Ip  by 


1/<^(1  —  x)  if  X  <  0, 
<p{\  -fa:)  otherwise; 


and 


1-^(1/X)  ifO<A:<l, 
'tp{X)  —  1  otherwise. 


The  symmetric  level-index  system  represents  values  as  the  level-index  system 
represents  them,  with  the  extension  that  the  level  values  tire  signed  integers. 


Olver  and  Clenshaw  define  methods  for  performing  addition  and  sub¬ 
traction  in  the  level- index  system  by  induction  on  the  levels  of  the  values 
being  combined;  the  induction  steps  typically  involve  evaluating  powers  of 
e  to  previously-determined  exponents.  Multiplication  and  division  can  be 
performed  similarly,  since  taking  logarithms  or  exponentials  is  equivalent  to 
decrementing  or  incrementing  levels.  They  have  also  developed  methods  for 
preforming  arithmetic  operations  in  the  symmetric  level-index  system.  They 
acknowledge  that  their  arithmetic  operations  can  be  expected  to  be  signifi¬ 
cantly  slower  than  the  analogous  operations  in  floating-point. 


The  arithmetic  on  the  symmetric  level-index  system  is  closed  if  the  in¬ 
dex  portions  of  the  values  represented  are  stored  in,  say,  fixed-point  binary 
with  a  limited  number  of  bits,  and  if  the  level  portions  can  be  reasonably 
large.  'Phe  reason  for  this  closure  is  that  for  large  values  of  .Y  the  levels  of 
the  values  <p{X),  (p{X  -r  -Y)  and  <p{X  ■  A  )  are  equal  and  their  indexes  are 
indistinguishable  to  the  fixed  number  of  bits  available  to  store  them.  Olver 
and  Clenshaw  show,  for  oxainple,  that  for  .12-bit  fixed-point  binary  indexes 
this  indistinguishabilitv  arises  for  levels  luss  than  6. 

For  a  fixed  miniber  of  bits,  tin-  level-index  system  is  least  precise  for  real 
numbers  0  <  .\  <1.  The  system  becomes  more  precise,  temporarily,  as  A'^ 
increases,  but  its  precision  eventually  decays  to  the  point  that  A'^  and  A’^  arc 
indistingui.shablc. 

The  precision  (jf  the  level- index  system  with  32-bit  indexes  compares  to 
IFyEE  floating-point  arithmetic  as  follows:  For  IEEE  single-precision  values, 
the  level-index  system  is  up  to  32  times  more  precise  for  1  <  A”^  <  2“,  is 
of  roughly  the  same  precision  for  2"  <  A'  <  2'**,  and  is  up  to  37  times  less 
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precise  for  2*®  <  X  <  2^^^.  For  IEEE  double-precision  values,  the  level- 
index  system  is  up  to  256  times  more  precise  for  1  <  <  2**,  is  of  roughly 

the  same  precision  for  2'*^  <  ^  <  2’^°,  and  is  up  to  68  times  less  precise  for 
2^°  <  X  <  2'°^®.  Beyond  these  ranges,  overflow  occurs  for  the  IEEE  floating¬ 
point  values.  These  precision  comparisons  are  for  extreme  cases,  though,  not 
for  typical  ones. 

Our  impression  is  that  this  representation  gains  its  advantage  of  freedom 
from  overflow  and  underflow  with  unacceptable  losses  in  computation  speed 
and  increased  complexity.  The  variable-length-exponent  representation  by 
Iri  and  Matsui  described  in  Section  5.4  gains  comparable  immunity  from 
overflow  and  underflow  with  much  simpler  operations.  Also,  as  Demmel 
notes  [Dem87],  any  representation  that  allows  the  precision  of  stored  values 
to  vary  makes  it  more  difficult  to  determine  the  precision  of  final  results, 
but  a  register  storing  the  largest  exponent-length  used  would  have  a  simple 
interpretation  making  such  a  determination  possible;  a  register  storing  the 
largest  or  smallest  level- value  used  would  be  much  less  useful,  since  possible 
values  with  a  given  level  have  such  a  great  range. 

The  representation  is  not  appropriate  as  a  means  for  representing  the 
endpoints  of  intervals,  and  it  does  not  facilitate  parallel  computation. 


5.7  Finite- Segment  p-adic  Representations 

Finite-segment  p-adic  arithmetic,  for  a  prime  p,  is  a  method  of  doing  exact 
rational  arithmetic  on  values  that  can  each  be  stored  in  a  fixed  amount  of 
computer  memory.  Further,  the  operations  in  this  arithmetic  are  similar  to 
those  for  ordinary  floating-point  arithmetic  carried  out  in  base  p. 

For  the  moment,  fix  p  and  a  positive  integer  r  and  let  m  =  p'’.  Let  N  be 
the  largest  integer  such  that  2N^  -f  1  <  m.  Define  subsets  of  the  rationals  by 

Q  =  {a/6  :  a,6  e  Z,gcd(a,  6)  =  1  and  gcd(6,  p)  =  1} 


and 

FN  =  {a/b^q:\a\<NAb\<N). 

If  a/6  and  c/d  are  rationals  in  Q,  call  them  equivalent  if  ad  =  bdmodm. 
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It  is  then  true  that  for  N  as  above,  each  equivalence  class  in  Q  contains  at 
most  one  member  of  Fyv- 

Identify  the  members  of  the  ring  Z/mZ  with  the  integers  0  to  m  ~  1, 
and  identify  the  operations  on  this  ring  with  the  corresponding  operations 
modulo  m.  For  any  integer  b  such  that  gcd(b,p)  =  1,  let  b~^  be  the  unique 
integer  k  from  1  to  m  —  1  such  that  bk  ~  1  mod  m;  such  an  integer  always 
exists  because  gcd(b,p)  =  1.  Define  a  mapping  from  Q  to  ’LjmTi  by  letting 
^  map  the  rational  afb  to  the  modulo-m  product  of  a  and  b~^.  This  mapping 
is  a  ring  homomorphism. 

The  process  of  using  finite-segment  p-adic  representations  works  as  fol¬ 
lows;  Start  with  rational  numbers  for  which  some  computation  is  desired, 
and  choose  p  so  that  all  these  rationals  are  in  Q.  Choose  r  sufficiently  large 
so  that  for  m  and  N  as  above,  all  these  rationals  are  in  Fat.  Apply  the  map¬ 
ping  (j)  to  the  rationals,  and  perform  all  the  desired  sums,  products  and  so 
on,  as  the  corresponding  operations  in  Z/mZ  using  finite  Hensel  codes,  which 
are  finite  segments  of  real  numbers’  representations  as  p-adic  values.  (Finite 
Hensel  codes,  algorithms  defining  the  mapping  <f>  and  its  inverse,  and  algo¬ 
rithms  defining  the  arithmetic  operations  in  TilmZ  in  terms  of  these  codes 
are  described  completely  in  [GK84];  the  codes,  the  operations  on  them,  and 
the  advantages  and  disadvantages  of  using  them  are  described  informally  in 
the  following  paragraphs.)  If  the  result  of  these  computations  is  the  image 
under  <f>  oi  a.  member  of  Fat,  this  rational  is  the  exact  result  of  the  desired 
computation  on  the  original  rational  numbers. 

The  finite  Hensel  codes  for  particular  values  of  p  and  r  as  above  axe  like 
ordinary  base-p  floating-point  values  with  r-digit  mantissas.  The  arithmetic 
operations  on  these  finite  Hensel  codes  are  similar  to  those  of  ordinary  base- 
p  floating-point  arithmetic,  with  the  exception  that  carries  axe  carried  to 
♦  he  right  rather  than  the  left;  this  corresponds  to  the  condition  that  two 
reals  are  close  in  the  p-adic  metric  if  and  only  if  their  difference  is  a  rational 
whose  reduced  form  has  a  numerator  divisible  by  a  large  power  of  p.  As 
a  consequence,  the  successive  digits  in  a  finite  Hensel  code  result  can  be 
calculated  from  left  to  right,  one  at  a  time,  with  no  possibility  that  later 
calculations  will  change  these  digits. 

Unfortunately,  Fat  is  not  closed  under  the  operations  of  addition,  subtrac¬ 
tion,  multiplication  and  division.  This  gives  rise  to  the  phenomenon  called 


76 


pseudo- overflow,  which  makes  it  impossible  to  associate  a  unique  member  of 
Fjv  with  the  result  of  a  finite-segment  p-adic  calculation.  In  such  a  situation, 
all  that  is  known  about  the  desired  exact  rational  answer  is  its  equivalence 
class  in  Q.  If  machine  integers  of  a  fixed  size  are  used  to  represent  the  powers 
of  p  in  the  finite  Hensel  codes,  ordinary  integer  overflow  is  also  possible. 

If  pseudo-overflow  occurs,  it  is  possible  to  simply  repeat  the  calculations 
with  a  larger  value  of  r.  Since  the  computations  of  earlier  digits  are  never 
changed  by  the  computations  of  later  ones,  no  results  already  computed 
have  to  be  recomputed.  Increasing  r  until  pseudo-overflow  does  not  occur  is 
thus  analogous  to  demanding  more  digits  or  more  continued-fraction  p^l^tial 
quotients  until  a  desired  degree  of  accuracy  is  obtained.  One  could  also  do 
similar  calculations  with  another  prime  p'  and  another  power  s.  Since  powers 
of  p  and  p'  are  relatively  prime,  the  separate  calculations  with  p  and  r  and 
with  p'  and  5  are  accurate  for  all  the  order- A^'  Farey  fractions  for  any  N'  such 
that  2{N'fl  -h  1  <  p*"  ■  (pO*-  The  two  calculations  together  are  more  likely  to 
result  in  pairs  of  Hensel  codes  that  have  a  unique  interpretation. 

Finite-segment  p-adic  arithmetic  provides  an  efficient  way  of  performing 
exact  rational  arithmetic  when  pseudo-overflow  does  not  occur.  It  also  gives 
an  efficient  way  of  representing  rationals,  so  that  for  reasonably  small  values 
of  p  and  r  the  family  of  representable  rationals  includes  most  rationals  that 
are  likely  to  arise  in  computations. 

Finite-segment  p-adic  arithmetic  has  the  major  defect,  though,  that  it 
does  not  provide  a  natural  means  for  discarding  information  when  pseudo¬ 
overflow  occurs.  The  p-adic  metric  is  drastically  different  from  the  normal 
one,  so  each  rational  in  Q  has  both  rationals  very  close  to  it  and  rationals 
very  far  away  from  it  in  its  equivalence  clciss. 

This  representation  is  not  appropriate  as  a  means  for  representing  the 
endpoints  of  intervals,  and  does  not  facilitate  parallel  computation. 
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Chapter  6 

Constructive  Reals 


Several  of  the  representation  systems  that  we  discussed  in  our  interim  re¬ 
port  [ORA88]  were  based  on  the  constructive  reals.  This  chapter  describes 
our  work  in  this  area.  The  chapter  begins  with  a  general  definition  of  the 
constructive  reals  and  brief  comments  on  their  properties.  It  describes  ba¬ 
sic  advantages  and  disadvantages  of  constructive- real  arithmetic,  and  lists 
implementations  of  this  arithmetic,  including  implementations  based  on  con¬ 
tinued  fractions  and  convergent  sequences  of  rationals.  The  chapter  then 
presents  the  results  of  our  work  with  continued  fractions  and  gives  more 
information  on  Boehm’s  implementation  of  constructive-real  arithmetic,  an 
implementation  based  on  convergent  sequences  of  rationals. 


6.1  Definitions  and  Basic  Properties 

A  constructive^  or  recursive,  real  is  a  real  number  for  which  there  exists  a  finite 
algorithm  capable  of  generating  arbitrarily-accurate  rational  approximations 
to  this  real.  We  will  follow  Boehm  (Boe87]  in  calling  these  reals  constructive, 
though  they  have  been  called  rectirsii'e  by  recursion  theorists  [Rog67].  The 
constructive  reals  iru  lude  all  rational  numbers,  all  algebraic  numbers,  and  an 
infinite  number  of  transcendental  numbers,  including  e  and  tt.  Intuitively, 
every  real  number  for  which  a  method  exists  for  computing  that  number 
arbitrarily  accurately  is  a  constructive  real,  so  if  the  ideal  inputs  to  ideal 
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computer  calculations  were  known  arbitrairily  accurately  then  all  the  numbers 
ctrising  in  these  calculations  would  be  constructive  reals. 

There  do  not  exist  valid  general  algorithms  for  deciding  whether  two 
constructive  reals  are  equal  or  whether  one  is  larger  than  the  other,  or  for 
deciding  whether  a  constructive  real  is  0  or  a  rational  [Rog67].  Ordinary 
floating-point  arithmetic  does  “better”  in  being  able  to  decide  equality  and 
order,  and  in  being  able  to  determine  that  a  number  is  0,  only  because  it 
represents  only  a  finite  number  of  reals.  If  two  constructive  reals  axe  con¬ 
sidered  equivalent  if  they  are  closer  together  than  a  user-supplied  tolerance, 
then  equality,  order,  and  being  different  from  0  are  all  decidable  for  the 
constructive  reals. 

Systems  implementing  constructive-real  arithmetic  take  algorithms  com¬ 
puting  particular  constructive  reals  and  create,  under  user  control,  algorithms 
for  computing  sums,  differences,  products,  quotients,  logarithms,  exponen¬ 
tials,  etc.,  of  these  reals.  The  user  inputs  reals  that  the  systems  take  to  be 
exact,  and  they  execute  the  algorithms  they  create  for  producing  sums,  prod¬ 
ucts,  quotients,  etc.,  to  produce  numerical  results  to  user-specified  degrees 
of  precision.  In  actual  implementations,  of  course,  the  amount  of  precision 
obtainable  is  limited  by  available  computer  memory  and  the  time  needed  to 
compute  the  results.  In  Aberth’s  {Abe88]  implementation  of  an  ^lrithmetic 
that  allows  the  user  to  set  the  precision,  for  exeimple,  at  most  roughly  120 
decimal  digits  of  precision  are  available. 

The  constructive  reals  are  strongly  related  to  the  notion  of  on-demand  or 
data-driven  precision.  In  a  system  providing  on-demand  precision,  the  user 
specifies  a  tolerance  for  relative  error  in  the  final  results  of  a  calculation.  The 
system  then  carries  out  the  intermediate  calculations  to  whatever  degree  of 
precision  is  necessary  to  give  final  results  that  are  guaranteed  to  be  as  precise 
as  the  user  has  specified.  The  degree  of  precision  necessary  for  intermediate 
results  can  vary  with  the  numbers  that  actually  arise;  in  evaluating  1/(1  —  i), 
for  example,  x  must  be  evaluated  more  precisely  if  it  is  approximately  equal 
to  1. 
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6.2  Advantages  and  Disadvantages 


The  basic  advantage  of  implementations  of  constructive-real  arithmetic  is 
that  they  produce  results  that  are  practically  arbitrarily  accurate.  These 
results  do  not  depend  on  anomalies  of  a  particular  form  of  finite  arithmetic, 
such  as  floating-point,  and  are  never  misleading  because  information  that 
later  turned  out  to  be  significant  was  thrown  away. 

Systems  providing  “on-demand”  precision  can  be  useful  even  if  there  are 
upper  limits  on  the  amount  of  precision  the  user  can  specify  for  final  results 
or  that  the  system  can  take  for  intermediate  ones.  Such  systems  do  not 
perform  constructive-real  arithmetic,  but  can  be  useful  in  situations  where 
the  amount  of  precision  needed  is  variable  but  typically  low;  they  can  save 
computation  time  by  not  calculating  any  intermediate  result  more  precisely 
than  they  have  to. 

We  noted  in  our  interim  report,  though,  that  since  constructive-reed  sys¬ 
tems  do  not  discard  information  the  amounts  of  time  and  space  they  use 
can  easily  become  excessive.  In  situations  where  high  precision  is  needed, 
‘he  systems  can  fail  to  provide  this  precision  within  the  available  space  and 
lime.  In  more  typical  situations  where  high  precision  is  not  needed,  the  sys¬ 
tems’  memory-management  overhead  for  dynamically  making  space  available 
for  results  can  slow  things  down  significantly  [KL85].  More  importantly,  in 
calculations  with  large  numbers  of  intermediate  results,  particularly  calcula¬ 
tions  involving  loops,  it  can  be  impossible  for  these  systems  to  store  edl  the 
algorithms  they  create  for  generating  arbitrarily-accurate  approximations  to 
these  intermediate  results. 

In  calculations  involving  loops,  a  constructive-real  system  can  compute 
intervals  containing  intermediate  results  to  a  high  degree  of  accuracy,  then  re¬ 
compute  these  intervals  to  a  higher  degree  of  accuracy  if  they  do  not  give  final 
intervals  sufficiently  short  to  guarantee  the  user- specified  degree  of  precision. 
(C.f.,  Aberth  [Abe88].)  Such  a  process  can  be  expected  to  be  slow,  though, 
and  can  Ccisiiy  be  unacceptable  in  situations  requiring  real-time  results. 

On  the  issue  of  possibly  using  constructive-real  systems  for  real-time 
applications,  Hans-Juergen  Boehm,  one  of  the  principal  developers  of  the 
constructive  real  system  de, scribed  in  Sections  6.3  and  6.5,  advised  us  [Boe89]: 
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My  view  is  that  none  of  this  is  presently  suitable  for  real-time 
applications.  It  is  probably  too  slow  at  present,  and  too  hard  to 
guarantee  response  time. 

Boehm  raises  the  critical  point  that  if  additional  computation  is  done  in  a 
data-dependent  fashion  to  maintain  a  desired  degree  of  precision,  this  intro¬ 
duces  a  possibly  unacceptable  dependence  of  the  response  times  on  the  input 
data. 

In  our  interim  report,  we  planned  to  seek  practical  situations  where  only 
a  limited  amount  of  precision  is  needed  in  results  that  can  be  taken  as  final, 
even  if  they  are  used  as  inputs  to  further  calculations,  but  where  the  precision 
needed  in  intermediate  results  to  obtain  this  final  precision  is  unpredictable. 
Finding  the  determinant  of  a  large  matrix  to  a  moderate  degree  of  precision 
is  an  example,  since  the  matrix  can  be  singular  or  neeirly  singular.  We  specu¬ 
lated  that  constructive-real  systems  might  be  useful  in  such  situations,  even 
for  real-time  applications. 

We  have  since  learned,  though,  that  having  intermediate  results  be  more 
precise  than  initial  inputs  is  generally  only  useful  for  avoiding  the  introduc¬ 
tion  of  calculation  error  in  computations  with  large  numbers  of  intermediate 
results.  The  determinant  of  a  nearly-singular  matrix  shows  the  problem:  Al¬ 
though  any  matrix  has  an  exact  determinant,  computing  this  determinant  to 
a  high  precision  implicitly  assumes  that  the  inputs  defining  the  matrix  are 
exact,  an  assumption  that  is  usually  not  warranted.  A  computation  with  a 
large  number  of  intermediate  results  is  exactly  the  sort  of  situation  where  a 
constructive-real  system  can  be  expected  to  perform  poorly. 

Other  situations  that  require  unpredictable  precision  in  their  intermediate 
results  also  require  unpredictable  precision  in  their  inputs,  so  are  unrealis¬ 
tic  for  real-time  applications.  Techniques  for  representing  constructive  real 
numbers  and  performing  operations  on  them  are  thus  more  likely  to  be  use¬ 
ful  for  real-time  applications  as  ways  of  providing  “on-demand”  accuracy  to 
reduce  the  necessary  amount  of  computation. 
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6.3  Implementations 


The  systems  doing  constructive-real  arithmetic  considered  in  our  interim 
report  included  ones  representing  real  numbers  as  lazily-evaluated  lists  of 
redundant  digits  (Pix82]  or  of  partial  quotients  [Jon84].  In  lazy  evaluation,  a 
potentiaJly-infinite  list  is  represented  as  an  initial  segment  of  that  list  together 
with  an  algorithm  capable  of  extending  the  list  on  demand  [ASS85].  As  in 
Section  3.4,  redundant  digits  avoid  the  impossibility,  say,  of  not  knowing 
whether  to  output  2  or  3  as  the  third  digit  of  cin  ordinary  base- 10  number 

whose  calculation  begins  8.72999 _ A  list  of  partial  quotients  Ccin  give  either 

a  number’s  standard  continued  fraction  or  one  of  its  generalized  continued 
fractions.  As  we  mentioned  in  Section  4.5,  and  will  discuss  in  Subsection 
6.4.2  below,  generalized  continued  fractions  give  exactly  the  same  sorts  of 
advantages  that  redundant  digit-sets  give. 

Our  interim  report  also  considered  an  implementation  of  constructive-real 
arithmetic  by  Boehm  [Boe87]  that  represents  a  real  x  as  a  function  fx  from 
the  integers,  where  the  integer  input  specifies  the  required  precision,  to  the 
rationals,  where  the  rational  returned  approximates  x  to  the  required  preci¬ 
sion.  For  efficiency,  this  system  represents  rationals  as  appropriately-scaled 
integers,  uses  interval  arithmetic  in  some  of  its  intermediate  calculations,  and 
records  the  results  of  earlier  computations  so  that  when  it  is  computing  a 
new  approximation  to  a  real  it  always  has  the  most  precise  previously  calcu¬ 
lated  approximation  to  that  real  available.  We  give  additional  information 
on  Boehm’s  system  in  Section  6.5. 

Jones  [Jon84]  has  implemented  a  ver.'sion  of  Gosper’s  algorithm  in  SASL 
[Tur76],  a  functional  programming  language  strongly  and  historically  related 
to  the  Caliban  language  used  in  the  Clio  prover  [BMS89),  the  theorem  prover 
used  by  the  Reals  project.  SASL  treats  every  object  as  a  function,  so  func¬ 
tions  can  be  passed  as  arguments  to  functions  and  returned  as  values  of 
functions.  This  makes  it  natural  for  implementing  a  version  of  Gosper’s  al¬ 
gorithm  that  takes  functions  —  i.e.,  infinite  fists  of  partial  quotients  —  as 
arguments  and  returns  them  as  values. 

Boehm’s  system  is  implemented  in  Russell  [BDD80],  a  strongly- typed 
functional  language.  Like  SASL,  Russell  treats  functions  as  first-class  objects 
that  can  be  passed  as  arguments  to,  and  returned  as  values  by,  functions.  It 
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is  also  natural  for  implementing  an  arithmetic  that  treats  real  numbers  ais 
functions  —  in  this  case,  functions  from  integers  specifying  desired  precisions 
to  integers  specifying  sufficiently-precise  rational  approximations. 

In  principle,  the  LCF  representation,  particularly  with  the  extensions  to 
redundant  bit-sets  mentioned  in  Subsection  5.2.4,  could  be  used  in  a  system 
of  constructive-real  arithmetic.  Matula  and  Kornerup’s  primary  interest  in 
this  direction,  though,  seems  to  be  in  using  LCF  in  “on-demand  precision” 
arithmetic  systems  that  minimize  necessary  computation. 

Chapter  8  lists  references  to  iterated-interval  arithmetic  systems.  These 
also  implement  constructive-real  arithmetic  or  “on-demand  precision”  ap¬ 
proximations  to  it. 


6.4  Continued  Fraction  Results 

We  implemented  code  for  generating  standard  and  optimal  continued  frac¬ 
tions  in  both  Caliban  and  C.  We  also  implemented  C  code  carr5ang  out 
Gosper’s  algorithm  to  take  standard  or  generalized  continued  fractions  as  in¬ 
puts  and  produce  standard  or  approximately-optimal  generalized  continued 
fractions  as  outputs. 

6.4.1  Caliban  Continued  Fractions 

Appendix  H  contains  the  Caliban  program  we  implemented  for  finding  the 
standard  and  optimal  continued  freictions  for  rationals  selected  by  user  input. 
The  user  executes  this  program  in  the  Clio  prover  by  simplifying  the  expres¬ 
sion  getcfrac  n,  where  n  is  a  positive  integer  less  than  134217728,  which  is 
the  maximum  value  of  one  of  Caliban’s  NUM’s.  (If  the  user  wants  the  program 
to  terminate  in  reasonably  short  order,  he  or  she  should  use  a  much  smaller 
value  of  n.)  The  Clio  prover  simplifies  the  expression  to  a  list  containing  adl 
the  fractions  t/n  for  n  <  t  <  2n,  given  as  numerator/denominator  pairs,  and 
their  standard  and  optimal  continued  fraction  expansions. 

Every  operation  used  in  this  program  is  completely  defined  within  Ap¬ 
pendix  H  except  for  the  arithmetic  operations  on  NUMs.  Definitions  of  the 
natural  numbers,  the  integers,  the  rationals,  the  usual  arithmetic  opera- 
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tions  on  these  numbers,  the  order  and  equality  relations  on  these  numbers, 
and  the  absolute-value  function  on  these  numbers,  are  contained  in  th^  file 
arith.def,  which  is  included  in  Appendix  H. 

We  were  originally  interested  in  u^ing  Caliban  to  implement  Gosper’s 
algorithm  because  we  had  already  obtained  Jones’  [Jon84]  implementation 
of  this  algorithm  in  SASL.  A  language  that  treats  functions  as  first-class 
objects  is  also  natural  for  constructive-real  arithmetic.  The  process  of  sim¬ 
plifying  Caliban  expressions  is  much  slower  than  the  process  of  executing  C 
code,  though,  and  Caliban  does  not  contain  the  convenient  I/O  statements 
available  in  C,  so  we  decided  to  use  C  rather  than  Caliban  for  our  further 
investigations  into  continued  fraw:tions  and  Gosper’s  algorithm. 

We  implemented  a  C  program  carrying  out  the  same  sorts  of  calculations 
as  those  performed  by  the  Caliban  program  in  Appendix  H,  and  in  addition 
suppressing  output  for  fractions  whose  standard  and  optimal  continued  frac¬ 
tion  expansions  are  the  same,  computing  convergents  for  both  standard  and 
optimal  continued  fractions,  and  maintaining  counts  of  the  numbers  of  par¬ 
tial  quotients  in  both  forms  of  continued  fractions.  This  program  led  us  to 
observe  some  of  the  properties  of  continued  fractions  and  their  convergents 
noted  in  Chapter  4,  but  we  have  not  included  it  because  we  do  not  believe 
it  is  of  further  interest. 

6,4.2  Gosper’s  Algorithm 

Sections  1  and  2  of  Appendix  1  contain  our  C  implementations  of  Gosper’s 
algorithm  for  standard  and  generahzed  continued  fractions.  These  programs 
only  compute  combinations  of  two  specific  infinite  continued  fractions,  and 
must  be  modified  and  recompiled  to  compute  combinations  of  other  continued 
fractions.  As  they  are  given,  the  programs  only  compute  combinations  of 
1  -t-V^)  which  hcLS  a  particularly  simple  standard  continued  fraction  expansion 
that  is  also  optimal,  with  itself: 

1  -I-  \/2  =  [2, 2, 2, 2, . .  .j  2.414213562373095. 

The  parts  of  these  programs  that  must  be  changed  to  compute  combina¬ 
tions  of  other  continued  fractions  are  very  specific,  though:  The  functions 
X  and  y  correspond  to  the  inputs  i  and  y  of  Gosper’s  algorithm.  These 


functions  take  no  arguments;  the  zth  time  x  or  y  is  called  it  returns  the  ith 
partial  quotient  of  x  or  y.  When  the  programs  x  and  y  perform  in  this  way, 
the  C  code  implementing  Gosper’s  algorithm  is  similar  to  functional  pro¬ 
gramming  language  code  that  treats  x  and  y  as  lazily-evaluated  infinite  lists. 
The  functions  rx  and  ry  take  no  arguments  and  return  double-precision  ap¬ 
proximations  to  X  and  y,  respectively;  they  are  used  for  producing  descriptive 
output. 

An  easy  way  to  have  different  calls  to  a  C  function  with  no  arguments 
return  different  values  is  to  have  a  static  counter  local  to  the  function  dis¬ 
tinguish  the  calls.  Section  3  of  Appendix  I  contains  such  a  function  that 
computes  the  standard  continued  fraction  for  e,  the  base  of  the  natural  log¬ 
arithms. 

Both  programs  use  double-precision  values  to  store  the  entries  of  the 
coefficient  cube  for  two  reasons; 

1.  Double-precision  floating-point  values,  at  least  on  most  computers,  can 
exactly  represent  larger  integers  than  integer  values  can;  and 

2.  In  IEEE  arithmetic,  the  hardware  maintains  a  status  flag  that  records 
whether  the  results  of  floating-point  computations  are  exact  or  approx¬ 
imate,  making  it  possible  to  detect  when  coefficient-cube  updates  are 
approximate.  (Aberth  [AbeSS]  seemed  to  not  be  aware  of  this  pos¬ 
sibility,  but  he  was  working  on  IBM  machines  on  which  it  does  not 
exi.st.) 

The  programs  take  the  partial  quotients  of  the  inputs  and  the  output  to 
be  in  double-precision  in  order  to  be  consistent  with  their  handling  of  the 
coefficient  cube. 

Both  programs  assume  that  their  inputs  axe  infinite  continued  fractions, 
and  make  no  attempt  to  handle  inputs  that  can  be  exhausted.  They  could 
presumably  be  rewritten  easily  to  test  for  the  IEEE  -(-c»  value  as  a  pcirtial 
quotient  and  use  it  to  mark  the  end  of  an  input  —  c.f.,  the  discussion  of 
treating  finite  continued  fractions  as  having  -foo  as  their  final  partial  quotient 
in  Subsection  5.2.1.  Instead,  both  programs  ingest  partial  quotients  of  their 
inputs  and  output  partial  quotients  of  their  outputs  for  as  long  as  their 
coefficient-cube  updates  are  exact. 
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Both  programs  use  floating-point  arithmetic  to  decide  whether  output  is 
possible  or  from  which  of  x  or  y  to  ingest  another  partial  quotient;  since 
this  arithmetic  is  approximate,  their  decisions  can  be  incorrect,  particularly 
after  enough  terms  have  been  ingested  to  make  the  distinctions  between 
the  alternatives  small.  Neither  program  attempts  to  determine  whether  its 
output  should  terminate,  but  this  is  appropriate  since  it  is  not  generally 
possible  to  determine  whether  a  constructive  real  is  rational. 

Vhe  standard  continued  fraction  program  implements  Matula  and  Ko- 
rnerup’s  [KM88]  version  of  Gosper’s  algorithm.  In  particular,  it  maintains  a 
decision  cube  to  give  the  four  extreme  values  z(l,l),  2(1,00),  2(00,1)  and 
2(00,00),  and  ingests  a  partial  quotient  from  x  if  the  integer  parts  of  its 
floating-point  estimates  of  2(1,1)  and  2(00,!)  are  not  equal.  If  these  two 
integer  parts  are  equal,  it  ingests  a  partial  quotient  from  y  if  the  integer 
paxt  of  either  of  its  estimates  of  2(1,00)  or  2(00,00)  differs  from  the  other 
three  integer  parts.  If  all  four  integer  parts  are  equal,  it  outputs  the  common 
integer  part  as  the  next,  partial  quotient  of  2. 

The  generalized  continued  fraction  program  does  not  use  a  decision  cube 
and  does  not  attempt  to  minimize  the  computations  it  performs  to  decide 
whether  to  produce  output  or  how  to  ingest  another  partial  quotient.  It  in¬ 
gests  a  partial  quotient  from  whichever  of  inputs  x  or  y  causes  the  greatest 
variation  in  its  directly-computed  floating-point  estimates  of  2(1, y)  when 
that  input  is  replaced  by  the  extreme  values  of  —00,  —  1, 1,  -foo  and  the  other 
input  is  held  constant  at  one  of  these  four  extremes.  Actually,  it  consid¬ 
ers  only  nine  cases  instead  of  sixteen,  because  z{  —  oo,y)  =  2(4-00,  y)  and 
2(x,  —00)  =  2(x,  4-00).  The  program  outputs  a  partial  quotient  of  z  when  all 
its  estimates  of  2  at  the  extremes  of  x  and  y  are  within  1  /2  of  each  other;  if 
it  outputs  a  partial  quotient,  it  takes  this  partial  quotient  to  be  the  integer 
nearest  the  average  of  the  largest  and  smallest  of  its  estimates  of  2  at  these 
extremes.  The  program  ;tssumcs  that  the  possible  values  of  2  are  bounded 
by  its  values  at  the  extremes  of  x  and  y;  we  did  not  check  this  assumption 
carefully,  even  for  initializations  of  the  coeflicient  cube  that  perform  the  four 
basic  operations  of  arithmetic. 

If  it  were  not  for  possible  error  in  the  estimates  of  2,  having  the  extreme 
values  of  2  be  within  1  of  each  other  would  guarantee  that  there  is  an  integer 
that  is  within  1  of  ail  the  possible  values  of  z.  By  using  the  tighter  bound  of 
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1/2,  and  by  taking  the  partial  quotient  it  outputs  to  be  the  integer  nearest 
the  average  of  the  largest  and  smallest  extreme  values,  the  program  chooses 
a  partial  quotient  that  is  within  1  of  all  the  extremes  of  z  even  with  error  in 
its  estimates  of  these  extremes.  Further,  this  partial  quotient  will  typicaUy 
be  within  1/2  of  all  the  possible  values  of  z. 

The  program  thus  computes  an  approximation  to  an  optimal  continued 
fraction.  The  program  only  assumes  that  its  inputs  are  generalized  continued 
fractions;  if  it  assumed  its  inputs  were  optimal,  it  could  evaluate  values  of  ^ 
at  the  extremes  ±oo  and  ±2  instead  of  ±cx3  and  ±1. 

Even  though  the  inputs  x  and  y  are  fixed,  the  user  can  compute  majiy 
different  combinations  of  these  inputs  by  varying  the  initial  entries  of  the 
coefficient  cube.  The  following  remarks  are  based  on  computations  for  z  = 
y  =  1  +  \/2  of  I  ■  y,  z  +  y,  z  —  y  and  z/y. 

When  the  result  is  rational,  as  in  z  —  y  =  0  or  z/y  =  1,  the  standard 
continued  fraction  program  “hangs”,  falling  victim  to  the  main  “catch”  in 
Gosper’s  algorithm.  It  outputs  no  partial  quotients  at  all.  The  generalized 
continued  fraction  program  produces  the  correct  first  partial  quotients  of  0 
or  1,  then  produces  no  more  partial  quotients  until  it  terminates  because 
of  an  inaccurate  coefficient-cube  update.  After  it  outputs  the  first  partial 
quotient,  giving  a  perfectly  correct  “approximation”  to  the  final  result,  the 
possible  extreme  values  of  z  computed  by  the  generalized  continued  fraction 
program  vary  over  larger  and  larger  extremes  as  the  program  ingests  more 
partial  quotients  from  z  and  y.  This  is  reasonable,  since  either  -f  oo  or  —oo 
as  the  next  partial  quotient  would  be  correct. 

For  z  •  y  =  3  -f  2-\/2,  the  standard  continued  fraction  program  outputs 
[5, 1,4, 1,4, 1,4, . . .]  while  the  general  continued  fraction  program  outputs 
[6,  — 6, 6,  — 6,6,  — 6, . . .].  The  general  continued  fraction  program’s  output 
converges,  on  a  partial  quotient  per  partial  quotient  basis,  to  its  ideal  value 
more  quickly  than  does  the  standard  continued  fraction  program’s  output, 
but  otherwise  their  results  are  similaj.  At  the  point  that  floating-point  ap¬ 
proximations  to  the  convergents  of  both  outputs  and  to  the  ideal  answer 
become  identical,  the  standard  and  generalized  continued  fractions  contain 
21  and  11  partial  quotients,  respectively;  the  standard  continued  fraction  has 
a  pair  of  partial  quotients  1  cind  4  for  every  generalized  continued  fraction 
partial  quotient  of  6  or  -6.  The  outputs  for  z  -f  y  =  2  -f-  2\/2  are  similar. 
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These  results  were  contrary,  at  least  in  practical  terms,  to  our  expec¬ 
tation  that  the  more- uniform  throughput  possible  for  generalized  continued 
fractions  would  reduce  the  magnitudes  of  the  coefficient-cube  entries  £iris- 
ing  in  the  calculations.  Coefficient-cube  entries  with  similair  magnitudes  are 
needed  by  the  two  programs  to  produce  outputs  with  a  given  precision.  In 
the  X  •  y  =  3  -1-  case,  the  largest  magnitudes  of  the  coefficient-cube  entries 
for  the  two  programs  at  the  times  when  floating-point  approximations  to  the 
convergents  first  equal  floating-point  approximations  to  the  ideal  answers  are 
identical.  In  particular,  in  both  programs  the  coefficient-cube  entries  typi¬ 
cally  grow  without  bound  as  the  programs  ingest  and  output  more  and  more 
partial  quotients. 

The  numbers  of  partial  quotients  the  two  programs  must  ingest  to  produce 
outputs  with  a  given  precision  are  roughly  equal.  The  generalized  continued 
fraction  program  must  occasionally  ingest  a  few  more  partial  quotients  be¬ 
cause  it  must  consider  more  possibilities  in  bounding  the  values  of  z. 

It  is  noteworthy,  though,  that  for  both  programs  the  partial  quotients 
converge  to  the  best  possible  value  representable  in  double-precision  long 
before  the  coefficient-cube  entries  become  too  large  to  represent  exactly  in 
double-precision.  This  is  presumably  related  to  why  the  Matula  and  Fergu¬ 
son  [FM85]  results  described  in  Subsection  5.1.4  on  using  simulated  mediant 
rounding  to  find  the  inverses  of  Hilbert  matrices  are  so  much  better  than  the 
floating-point  results  for  these  scune  calculations,  even  though  the  simulation 
uses  exactly  the  same  floating-point  hardware. 


6.5  Boehm’s  Constructive  Reals 

Boehm’s  (Boe87)  constructive-real  system  comes  in  an  arbitrary-precision 
“desk  calculator”  that  runs  on  Sun  workstations.  The  calculator  is  able  to 
produce  results  whose  precisions  are  limited  only  by  available  computer  mem¬ 
ory  and  user  patience.  The  amount  of  patience  it  requires  is  very  reasonable; 
on  a  Sun  3/60,  we  used  the  calculator  to  compute  e®"  to  1028  decimal  places 
in  less  than  149  seconds. 

We  originally  planned  to  test  the  time  and  space  requirements  of  Boehm’s 
system  on  a  problem  with  a  large  number  of  intermediate  results  by  finding 
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the  determinant  of  a  large  matrix,  but  we  were  unable  to  do  so  before  we 
were  asked  to  finish  Task  5.  Such  a  calculation  is  difficult  to  do  with  the 
calculator  because  the  calculator  only  performs  operations  in  response  to 
user  input;  it  is  not  programmable.  The  following  paragraphs  sununarize  the 
information  we  gathered  showing  how  Boehm’s  system  could  be  used  for  the 
sort  of  problem  we  intended  to  test  it  on,  and  give  recent  results  that  Boehm 
gave  us  on  the  system’s  speed.  We  have  already  noted  Boehm’s  opinion  on 
the  underlying  issue  of  whether  constructive-real  representations  might  be 
useful  for  real-time  command  and  control  applications. 

Boehm  and  Vernon  Lee  have  developed  a  matrix-arithmetic  package  writ¬ 
ten  in  Russell  that  can  be  used  with  the  constructive-real  package.  It  does 
not  include  taking  determinants,  but  does  include  a  Gaussian  elimination 
routine  that  could  be  used  as  an  example  of  a  problem  of  similar  complexity. 
In  addition,  recent  versions  of  the  Russell  compiler  include  source  for  versions 
of  the  constructive-real  package  that  can  be  called  from  C  code. 

Boehm  gave  timing  results  for  problems  requiring  on  the  order  of  100  in¬ 
termediate  results  in  [Boe87].  For  example,  on  a  Sun  3/260  with  a  Motorolla 
68881  floating-point  coprocessor,  with  sufficient  heap  space  already  allocated 
in  computer  memory,  the  constructive-real  package  took  about  183  seconds 
to  compute  100!  by  taking  the  natural  logarithm  of  each  of  the  niunbers  from 
1  to  100,  adding  these  logarithms,  and  finding  the  exponential  of  the  result. 
Current  speed  numbers  for  the  package  [Boe89]  are  about  20%  better  than 
the  results  reported  in  [Boe87]  because  of  incremental  improvements  to  the 
package  and  the  Russell  compiler. 

Boehm  seems  to  have  gotten  a  factor  of  10  speed  improvement  with  newer 
implementations  of  the  package  based  on  iterated  interval  arithmetic,  but 
these  implementations  are  not  yet  ready  to  be  distributed  [Boe89].  We  give 
brief  comments  about  iterated  interval  arithmetic  in  Chapter  8. 
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Chapter  7 

Conclusions  and  Questions 


This  chapter  summarizes  our  conclusions  and  lists  possible  questions  for  fu¬ 
ture  research.  The  conclusions  are  primarily  addressed  to  Task  5’s  Air  Force 
sponsors. 


7,1  Conclusions 

Although  the  absolute  bounds  on  error  given  by  interval  analysis  are  de¬ 
sirable,  the  bounds  given  by  simply  replacing  sccdar  arithmetic  operations 
with  corresponding  interval  ones  are  so  overly  conservative  that  they  usu¬ 
ally  do  not  correctly  show  the  dependence  of  computed  outputs  on  errors 
in  inputs  and  intermediate  calculations.  Matijasevich  [Mat85]  has  suggested 
a  technique  for  efficiently  computing  partial  derivatives  that  might  lead  to 
a  way  of  limiting  this  problem,  but  his  technique  does  not  yet  adequately 
handle  programs  with  loops.  Simply  replacing  scalar  arithmetic  operations 
with  corresponding  interval  ones  also  does  not  give  a  useful  extension  of  the 
Real’s  project’s  notion  of  asymptotic  correctness.  More  sophisticated  inter¬ 
val  algorithms  can  produce  surprisingly  perfect  results,  though,  and  should 
be  investigated  for  their  possible  command  and  control  applications. 

The  trade-offs  involved  in  making  the  asymptotic  model  more  and  more 
realistic  are  difficult  to  evaluate  since  the  time,  space  and  circuit-complexity 
costs  of  increasing  the  precision  of  floating-point  arithmetic  are  high.  How- 
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ever,  the  asymptotic  model  is  presumably  reasonably  accurate,  since  these 
trade-offs  were  implicitly  considered  in  designing  recent  forms  of  floating¬ 
point  arithmetic.  Adherence  to  the  IEEE  floating-point  standard  seems  to 
force  some  operations,  particularly  division,  to  be  performed  more  slowly, 
and  IEEE  arithmetic  is  slightly  less  precise  than  VAX  arithmetic,  for  double¬ 
precision  vzdues,  because  it  allocates  more  bits  to  the  exponent.  Versions  of 
floating-point  arithmetic  with  redundant  digit  sets  facilitate  parallel  process¬ 
ing  by  supporting  on-line  arithmetic. 

There  are  alternative  representation  systems  in  the  engineering  literature 
that  are  good  for  situations  involving  rational  numbers  and/or  highly  parallel 
computations.  Command  and  control  applications  should  be  examined  for 
the  presence  of  such  situations  to  identify  cases  where  these  representation 
systems  might  be  useful.  There  is  also  one  alternative  representation  sys¬ 
tem,  the  variable-length-exponent  version  of  floating-point  by  Iri  and  Matsui 
[MI81],  that  we  believe  is  generally  preferable  to  the  standard  floating-point 
representations  for  command  and  control  applications. 

Approximate  rational  arithmetic  provides  a  means  for  capturing  many  of 
the  advantages  of  exact  rational  arithmetic  without  incurring  unacceptable 
time  and  space  costs.  Mediant  rounding,  with  its  biais  towards  simplicity, 
can  greatly  increase  the  accuracy  of  computations  on  numbers  whose  ideal 
values  are  rational.  Representations  based  on  continued  fractions,  partic- 
ulairly  redundant  generalized  continued  fractions,  treat  exact  rational  and 
approximate  real  computations  uniformly,  and  facilitate  paurallel  processing 
by  supporting  arithmetic  operations  baised  on  Gosper’s  algorithm  that  can 
be  caurried  out  on-line  and  lairgely  in  parallel. 

The  most  mathematically  interesting  representation  systems  in  the  litera¬ 
ture  are  the  extended  floating-slash  (MK85]  and  binary  lexicographic  contin¬ 
ued  fraction  (LCF)  (KM88]  representations  by  Matula  and  Kornerup.  These 
systems  implement  approximate  rational  arithmetic,  and  experimental  re¬ 
sults  indicate  that  floating-slash  is  much  more  accurate  than  floating-point 
for  computations  on  quantities  whose  ideed  values  are  rational  and  is  not 
significantly  worse  than  floating-point  on  other  computations.  In  addition, 
the  extended  floating-slash  representation  has  the  same  ability  as  floating¬ 
point  to  conveniently  represent  numbers  of  widely  varying  magnitudes.  The 
LCF  representation  makes  it  easy  to  decide  which  of  two  numbers  is  larger, 
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and  also  supports  bitwise,  on-line,  variable-accuracy  arithmetic.  The  LCF 
representation  does  not  have  floating-point’s  ability  to  conveniently  repre¬ 
sent  numbers  of  widely  varying  magnitudes,  but  a  possible  extension  of  this 
representation  might. 

Approximate  rational  arithmetic  is  probably  most  useful  in  command  and 
control  applications  for  knowledge- based  systems  in  which  it  is  critical  to  rec¬ 
ognize  simple  rationals.  It  might  also  be  useful  in  combinatorial  optimization 
problems  involving  integer  quantities.  In  situations  where  it  might  be  use¬ 
ful,  simulating  mediant  rounding  in  floating-point  might  suffice  to  capture 
most  of  approximate  rational  arithmetic’s  advantages,  and  would  do  so  with 
standard  hardware. 

The  alternative  representation  system  in  the  literature  that  we  believe 
would  be  most  useful  for  typical  command  and  control  applications  is  the 
variable- length-exponent  representation  by  Iri  and  Matsui  [MI81].  This  sys¬ 
tem  is  capable  of  representing  quantities  of  such  widely- varying  magnitudes 
that  it  practically  eliminates  overflow  and  underflow,  and  it  is  more  accurate 
than  floating-point  for  typical,  moderately-sized  numbers.  We  believe  that 
this  system  should  be  implemented,  as  Demmel  [Dem87]  suggests,  with  hard¬ 
ware  recording  the  maximum  length  of  an  exponent  that  hats  arisen  in  cal¬ 
culations.  We  believe  Iri  and  Matsui’s  system,  with  the  extension  suggested 
by  Demming,  is  preferable  to  versions  of  floating-point  that  use  comparable 
numbers  of  bits. 

Other  alternatives  in  the  engineering  literature,  particularly  Hwang  and 
Chang’s  [HC78],  combine  the  advantages  of  floating-point  and  approximate- 
rational  representations  or  give  other  methods  of  avoiding  overflow  and  un¬ 
derflow.  We  do  not  believe  these  representations  would  be  as  effective  for 
these  purposes  as  the  floating-slash  and  variable-length-exponent  represen¬ 
tations. 

Representations  based  on  the  constructive  reals,  in  which  quantities  cau 
be  calculated  to  a  virtually-arbitrary  user-set  precision,  are  unlikely  to  be  use¬ 
ful  for  real-time  command  and  control  applications.  Implementations  using 
these  representations  tend  to  be  slow,  and  they  pay  for  predictable  preci¬ 
sion  with  unpredictable  response  times.  Further,  implementations  based  on 
such  representations  typically  assume  that  inputs  are  exact,  which  is  unreal¬ 
istic  for  command  and  control  applications.  However,  these  representations 


92 


are  very  useful  for  interactive  cinalysis,  and  they  suggest  the  desirability  of 
representations  that  support  variable-precision  arithmetic. 

We  propose  several  possible  representation  systems  based  on  ideas  from 
Iri,  Matsui,  Matula,  Kornerup  and  Aberth  in  Section  7.2.  These  representa¬ 
tion  systems  raise  questions  about  continued  fractions  that  we  give  in  Section 
7.3,  and  the  continued  fraction  questions  raise  an  open  mathematical  ques¬ 
tion  about  possible  representations  of  integers  that  we  give  in  Section  7.4. 


7.2  Variable-Length-Exponents 

This  section  lists  several  possible  variants  of  Iri  and  Matsui’s  [MI81]  variable- 
length-exponent  representation.  All  these  variants  are  based  on  using  self¬ 
delimiting  exponents,  as  in  Matula  and  Kornerup’s  LCF  encoding  of  positive 
integers  [MK83],  rather  than  a  fixed-length  field  giving  exponent  lengths. 
Some  of  these  variants  also  perform  approximate  rational  arithmetic,  support 
variable-precision  arithmetic,  or  facilitate  p2urallel  processing. 

Our  primary  interest  in  Iri  and  Matsui’s  variable-length-exponent  rep¬ 
resentation  is  not  in  its  immunity  from  overflow  and  underflow,  but  in  its 
efficient  use  of  bits.  It  first  takes  bits  for  the  exponent,  which  gives  the  num¬ 
ber’s  general  magnitude,  and  leaves  any  remaining  bits  for  the  mantissa.  We 
will  initially  assume  that  the  exponent  is  for  base  2,  as  it  is  in  [MI81],  and  we 
will  ignore  negative  exponents  —  it  would  be  eiisy  enough  to  code  numbers 
having  negative  exponents  by  giving  each  number  two  sign  bits,  one  for  the 
number  and  another  for  its  exponent.  A  number’s  exponent  gives  the  most 
significant  information  about  the  number,  so  it  is  appropriate  that  as  many 
as  necessary  of  whatever  bits  are  available  for  coding  the  number  be  used  to 
code  its  exponent. 

If  the  exponents  in  quemtities  that  arise  most  often  are  small,  reserving  6 
bits  to  give  the  length  of  the  exponent,  as  Iri  and  Matsui  do  in  [MI81],  wastes 
bits.  In  the  LCF  encoding  of  positive  integers,  for  example,  the  integers  1,  2, 
3  and  4  are  coded  by  the  strings  0, 100, 101  and  11000,  respectively.  The  LCF 
encoding  only  hzis  6  “wasted”  bits  —  the  leading  I’s  that  give  the  length  of 
the  integer’s  binary  value  after  that  value’s  leading  1  —  for  integers  greater 
than  63.  Floating-point  numbers  roughly  as  large  sis  2®^  «  10^®  are  rare,  so 
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using  an  LCF  encoding  of  the  exponent  would  typically  leave  more  bits  free 
to  be  in  the  mantissa. 

Further,  the  coded  value  of  an  exponent  could  be  shifted  by  a  constant 
to  minimize  the  typical  lengths  of  exponents’  LCF  encodings.  If  statistical 
studies  indicated  that  the  binary  exponent  occuring  most  often  in  typical 
calculations  was  3,  for  example,  a  number’s  exponent  could  be  talcen  to  be 
an  LCF  exponent  value  plus  2. 

Also,  with  self-delimiting  exponents  the  lengths  of  the  bit  strings  used 
to  represent  numbers  need  not  be  fixed.  The  initial  bits  of  one  of  these 
strings  could  define  a  self-delimited  exponent  and  any  remaining  bits  could 
then  be  the  mantissa.  The  precision  to  which  results  were  calculated  could 
then  be  controlled  by  a  bit-  or  byte-count  number-length  value  maintained 
in  hardware  and  subject  to  program  control;  Aberth’s  [Abe88]  PRECISION 
variable  serves  a  similar  purpose.  Special  “exponent”  values,  such  as  number 
fields  containing  only  I’s,  could  be  used  to  to  code  infinite  and  not-a-number 
values  for  each  precision,  as  well  as  ±0. 

As  we  noted,  Iri  and  Matsui  [MI81]  only  propose  performing  arithmetic 
as  if  results  were  first  calculated  exactly  and  then  rounded  to  the  nearest 
representable  values.  There  seems  to  be  no  reason,  though,  why  their  repre¬ 
sentation  could  not  be  used  with  each  of  the  four  rounding  modes  possible 
in  IEEE  floating-point  arithmetic  [IEE85].  We  would  do  so,  and  would  also 
adopt  Demmel’s  [Dem87]  suggestion  of  maintaining  a  hardwaje  register  to 
record  the  largest  exponent-length  that  has  arisen  in  calculations  for  the 
current  precision. 

We  have  so  far  assumed  tliat  exponents  are  for  the  bcise  2  and  mantiss^ls 
are  ordinary  base-2  values.  It  would  also  be  possible,  though,  to  use  continued 
fractions  as  mantisscis  and  use  LCF  encodings  of  these  mantissas.  That 
would  make  the  system  able  to  exactly  represent  a  great  many  rationals, 
with  the  number  of  exactly-representable  rationals  determined  by  the  current 
precision.  (Every  rational  has  a  finite  continued  fraction,  and  a  rational  is 
exactly  representable  for  a  given  precision  if  this  precision  is  large  enough 
to  code  that  continued  fraction.)  This  would  also  presumably  simplify  the 
hardware  needed  to  perform  arithmetical  operations  on  quantities  of  varying 
precisions  (i.e.,  bit  lengths)  and  increase  the  numbers  of  operations  that  could 
be  done  in  parallel.  If  this  were  done,  some  base  other  than  2  might  be  better 
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for  the  exponent,  though  our  initial  guess  is  that  it  would  not  be.  If  this  were 
done,  it  might  also  be  useful  to  allow  mediant  rounding  as  a  rounding  option. 
It  might  also  be  useful  to  allow  some  sort  of  redundant- binary  LCF-like 
encoding  of  the  mantissas,  as  suggested  by  Matula  and  Kornerup  [KM88]. 

This  last  proposal  is  essentially  the  same  as  Matula  and  Kornerup’s 
[KM88]  LCF  proposal,  including  its  possible  extension  to  redundantly-coded 
binary  values.  It  adds  a  self-delimited  initial  exponent,  an  explicit  mention 
of  precision  bounds,  a  largest-exponent-length  register,  and  more  possible 
roundings. 

Finally,  it  might  be  useful  to  code  mantissas  as  binary  representations 
of  generalized  continued  fractions  rather  than  as  redundant-binary  repre¬ 
sentations  of  standard  continued  fractions.  The  issue  is  whether  the  more 
rapid  convergence  of  optimal  or  nearly-optimal  generalized  continued  frac¬ 
tions,  with  their  more  uniform  throughput  for  Gosper’s  algorithm,  is  worth 
the  cost  of  the  bits  needed  to  give  the  signs  of  the  partial  quotients.  This 
issue  is  behind  some  of  the  questions  about  continued  fractions  in  Section 
7.3. 

The  main  potential  difficulties  we  see  with  these  possible  representations 
are  with  operation  speed  and  the  size  and  complexity  of  the  necess£iry  hard¬ 
ware.  These  issues  also  arise  in  the  questions  in  Section  7.3,  and  are  behind 
the  theoretical  question  in  Section  7.4. 


7,3  Continued  Fraction  Questions 

This  section  lists  questions  about  using  generalized  continued  fractions  ajid 
about  using  Gosper’s  algorithm  to  do  arithmetic  on  them. 

Would  it  be  practical  to  use  Gosper’s  algorithm,  and  initialize  the  coeffi¬ 
cient  cube’s  entries  appropriately,  to  get  the  effect  of  multiplying  standard  or 
generalized  continued  fractions  by  powers  of  a  fixed  base?  If  not,  it  would  not 
be  practical  to  give  numbers  as  exponents  and  continued-fraction  mantissas. 

What,  if  any,  relationship  is  there  between  using  redundant- binary  arith¬ 
metic  to  code  the  LCF  results  of  having  Gosper’s  algorithm  output  standard 
continued  fractions,  as  Matula  and  Kornerup  [KM88]  suggest,  and  using 
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signed  partial  quotients  to  code  the  results  of  having  Gosper’s  algorithm  out¬ 
put  generalized  continued  fractions?  Would  generalized  continued  fractions 
give  outputs  that  are  easier  for  hardware  to  interpret? 

Which  values  of  z  in  Gosper’s  algorithm  must  be  evaluated  for  generalized 
continued  fraction  inputs  fractions  in  order  to  definitely  bound  all  possible 
veJues  of  zl  Is  it  possible  to  maintain  a  generalized  “decision  cube”  to  give 
these  bounds? 

How  rapidly  can  the  entries  of  the  coefficient  cube  grow  for  redundant¬ 
binary  output,  either  for  standard  or  genercJized  continued  fractions? 


7.4  Integer  Representations 

Our  study  of  Gosper’s  algorithm,  particularly  the  possibility  suggested  by 
Matula  and  Kornerup  [KM88]  that  a  redundant-binary  encoding  of  the  al¬ 
gorithm’s  outputs  might  improve  its  throughput,  led  us  to  wonder  whether 
there  might  exist  an  encoding  of  the  partial  quotients  of  standard  or  general¬ 
ized  continued  fractions  that  would  allow  Gosper’s  algorithm  to  produce  an 
indefinite  number  of  partial  quotients  even  if  the  magnitudes  of  its  coefficient- 
cube  entries  were  bounded.  Thinking  about  a  special  case  of  this,  the  case 
in  which  only  one  partial  quotient  is  ingested  from  each  of  the  two  inputs, 
led  us  to  pose  the  following  question  about  possible  representations  of  the 
integers: 

Informally,  the  question  is  whether  there  exists  a  (presumably  redundant) 
encoding  of  the  integers  that  makes  it  possible  for  finite-state  machines  to 
compute  sums  and  products  and  also  to  decide  order.  Let  A*  be  the  set  of 
all  finite  strings  of  elements  of  an  alphabet  A.  To  ask  the  question  about 
integer  encodings  precisely,  do  there  exist  the  following: 

•  A  finite  alphabet  A; 

•  A  recursive  function  /  from  A*  onto  the  integers; 

•  Two  restricted  Turing  machines  M\  and  M2,  both  with  two  one-way 
input  tapes  and  one  one-way  output  tape,  which  compute  the  functions 

:  A*  X  A*  — ►  A*  respectively;  and 
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•  A  restricted  Turing  machine  M3  with  two  one-way  input  tapes  that 
always  halts  when  started  with  elements  of  A*  on  its  input  tapes  and 
always  halts  in  an  accepting  or  rejecting  state; 

having  the  properties  that  for  all  y  €  A*, 

1-  f{9i{x,y))  =  f{x)  +  f{y), 

2.  f{g2{x,y))  =  f{x)  ■  f{y),  and 

3.  M3  accepts  {x,y)  if  and  only  if  f(x)  <  /(y)? 

Note  that  the  condition  required  to  give  an  affirmative  answer  to  this 
question  is  very  strong.  A  single  machine  must  be  able  to  multiply  arbitrzudly 
large  integers  if  they  are  coded  appropriately.  It  does  not  suffice  to  have  a 
single  machine  unit  that  can  be  composed  into  cirrays  such  that  £uiy  pair  of 
integers  can  be  multiplied  by  one  of  these  arrays,  as  is  done  in  [Atr65]. 

When  we  first  posed  a  variant  of  this  question,  we  presumed  that  the 
cinswer  was  “no”,  and  expected  to  go  on  to  consider  how  rapidly  the  number 
of  states  the  coefficient-cube  entries  can  assume  increases  as  the  total  number 
of  partial  quotients  Gosper’s  algorithm  ingests  and  outputs  increases.  We 
learned,  though,  that  this  seemingly  simple  question  about  encoding  the 
integers  is  more  subtle  than  we  had  imagined,  and  is  still  open.  Professors 
Hopcroft,  Hartmanis  and  Kozen  of  Cornell  University  advised  us  that  they 
knew  of  no  work  that  answered  this  question,  and  in  a  recent  paper  Regan 
[Reg88]  mentioned  a  ring-theoretic  generalization  of  the  question  as  being 
open. 

For  the  remainder  of  this  section,  we  will  take  “finitely  computable”  to 
mean  “computable  (for  a  function)  or  decidable  (for  a  relation)  via  a  fixed 
recursive  encoding  by  a  restricted  Turing  machine  with  one-way  input  and 
output  tapes”.  With  this  terminology,  the  question  we  have  a.sked  is,  “Is 
there  a  possibly-redundant  recursive  encoding  of  the  integers  such  that  ad¬ 
dition,  multiplication  and  order  axe  all  simultaneously  finitely  computable?” 
t  Replacing  “order”  by  “equadity”  gives  a  simple  weakening  of  the  desired  con¬ 

ditions  —  i.e.,  M3  accepts  (x,  j/)  if  and  only  if  /(i)  —  /(y).  We  do  not  require 
i  that  the  encoding  /  itself  be  finitely  computable,  but  only  that  it  be  recur- 

*  sive,  so  we  do  not  impose  any  time  or  space  restraints  on  the  computations 

I  needed,  say,  to  convert  strings  in  A*  to  ordinary  signed  base-10  numbers. 
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The  answer  to  our  question  is  “yes”  if  it  is  weakened: 

•  A  signed  variant  of  unary  notation  with  a  symbol  for  0,  where  the 
positive  integer  n  is  represented  by  n  consecutive  I’s,  makes  addition 
and  order  finitely  computable. 

•  A  signed  variant  of  prime-power  notation  with  a  symbol  for  0,  where 
the  positive  integer  n  is  represented  by  a  string  of  positive  integers  in 
unary  that  give  the  exponents  of  n’s  factorization  into  primes,  makes 
multiplication  and  equality  finitely  computable.  For  example,  if  com¬ 
mas  separate  the  exponents  and  #  marks  the  end  of  the  number,  since 
the  consecutive  primes  are  2,  3,  5,  7,  . . .  the  encoding  of  84  =  2^  •  3  •  7 
is  11,1,0,1#. 

•  If  prefix-Polish  expressions  in  ordinary  base-10  notation  are  taken  to 
be  encodings  of  the  numbers  obtained  by  evaluating  these  expressions, 
then  addition  and  multiplication  are  finitely  computable.  If  +  ard  * 
denote  addition  and  multiplication  respectively,  and  if  commas  separate 
different  expressions,  the  product  of  857  and  973  can  be  “encoded”  as 
*857,973. 

Compositions  of  finilely-computable  functions  need  not  be  finitely  com¬ 
putable,  and  tlio  answer  to  our  question  is  more  dependent  on  exactly  how 
it  is  formulated  than  one  might  expect.  For  example,  even  though  letters 
from  a  finite  alphabet  on  two  separate  input  tapes  can  be  coded  by  a  letter 
from  a  larger  alphabet  on  a  single  input  tape,  our  question  cannot  necessarily 
be  adapted  to  machines  with  a  single  input  tape.  The  question,  “Are  there 
twice  as  many  x’s  on  tape  1  as  there  are  on  tape  2?”  is  easily  answerable  by 
a  restricted  Turing  machine  with  tw'o  sei^arate,  independently-rontroUable 
one-way  input  tapes,  but  is  not  ainswerable  by  such  a  machine  if  the  two 
input  tapes  are  both  coded  onto  a  single  one-way  tape. 

With  a  redundant  system,  a  steady  output  stream  of  digits  also  does  not 
necessarily  mean  a  steady  output  stream  of  useful  information:  1111111  = 
0000001.  Still,  with  generalized  continued  fractions  and  Gosper’s  algorithm 
it  is  possible  to  steadily  output  partial  quotients  that  all  give  significant 
information  about  the  true  value  of  the  output,  so  this  suggests  that  an 
encoding  of  the  integers  making  addition,  multiplication  and  order  all  finitely 


98 


computable  might  have  the  additional  property  that  each  letter  of  a  word 
imposes  significant  bounds  on  the  integer  coded  by  that  word. 

We  still  expect  the  answer  to  our  question  to  be  “no”,  but  if  there  is 
an  encoding  of  the  integers  that  makes  the  answer  “yes”,  and  if  with  this 
encoding  the  necessary  compositions  of  finitely-computable  functions  needed 
to  carry  out  Gosper’s  algorithm  are  themselves  finitely  computable,  then 
there  exists  a  representation  system  based  on  continued  fractions  for  whidi 
a  fixed  hardware  unit  can  perform  operations  on  arbitrarily-precise  inputs 
and  produce  outputs,  possibly  after  an  initial  delay,  as  rapidly  as  it  reads  its 
inputs.  Such  a  representation  system  would  thus  limit  hardware  complexity, 
support  use  of  variable  precision,  and  facilitate  parallel  computation. 


Chapter  8 
Task  Notes 


This  chapter  ties  up  loose  ends  from  Task  5  and  our  interim  report  [ORA88]. 
Tlie  chapter  notes  efforts  planned  in  our  interim  report  that  we  changed  or 
were  unable  to  carry  out,  and  corrects  two  errors  in  our  interim  report. 


8.1  Interval  Probabilities 

We  iiolcd  in  our  interim  report  that  it  would  sometimes  be  very  useful  if  one 
onld  a«sign  probabilities  to  the  distribution  of  ideal  results  in  the  intervals 
jnc.'d'i<;('d  by  interval  operations.  It  might  be  much  more  valuable  to  know 
'.at  tiicre  is  a  99.73%  chance  that  a  value  is  between  7.876  and  7.877,  for 
f.va  uple,  than  to  know  with  certainty  that  this  value  is  between  4.5  and  9.3. 
'Ve  did  nut  find  anything  useful  when  we  investigated  this  possibility. 

A:'  wi'  noted  in  our  interim  report,  if  one  starts  with  uniform  distributions 
and  com;  >ul.es  the  probability  distribution  of  the  results  of  interval  operations 
aernrately,  the  amount  of  information  that  must  be  stored  for  an  interval  caji 
become  arbitrarily  large.  Further,  most  of  this  information  is  useless.  We 
thus  looked  into  using  approximate  probability  distributions  that  can  be 
parameterized  simply.  The  only  well-known  distribution  besides  the  uniform 
distribution  that  is  parameterized  simply  and  applies  to  values  that  range 
over  an  interval  is  the  beta  distribution,  and  it  does  not  describe  the  results 
of  interval  operations  in  any  natural  way. 
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8.2  “Briggsian”  Algorithms 


We  were  unable  to  perform  the  work  we  planned  to  do  on  quamtitative  results 
for  interval  versions  of  “Briggsian”  algorithms,  which  perform  the  algebraic 
operations  on  some  pocket  calculators.  We  were  unable  to  find  literature 
on  these  algorithms.  The  name  we  were  given  for  them  might  have  been 
nonstandard,  and  significant  information  about  them  might  be  proprietary. 


8.3  Theoretical  Floating-Point  Speed  Limits 

As  we  noted  in  Chapter  3,  we  were  unable  to  find  conclusive  answers  to 
questions  about  the  theoretical  speed  limits  of  floating-point  arithmetic  car¬ 
ried  out  both  consistently  with  the  IEEE  standard  and  otherwise,  though 
what  we  did  find  suggests  that  adherence  to  the  IEEE  standard  theoretically 
causes  a  significant  loss  in  speed  for  division  and  tadring  square  roots. 

We  were  also  unable  to  find  the  extent  to  which  parts  of  individual 
floating-point  operations  could  be  done  in  parallel  or  how  effectively  many 
different  floating-point  operations  could  be  done  in  parallel,  with  the  excep¬ 
tion  of  the  information  on  pipelined  floating-point  arithmetic  with  highly 
redundant  mantissas  given  in  Chapter  3.  We  found  some  information  on 
how  these  questions  are  addressed  in  Cray  supercomputers  [HX85],  but  were 
unable  to  study  it  or  other  hterature  on  parallel  computing  thoroughly. 


8.4  Alternative  Representations 

We  ran  out  of  time  on  Task  5  before  we  learned  enough  about  some  of  the 
2dtemative  representation  systems  considered  in  the  literature,  particularly 
these  two,  to  make  meaningful  evaluations  of  them. 

Iterated  interval  arithmetic  is  apparently  a  technique  for  computing  and 
recomputing  interval  bounds  on  desired  quemtities  until  these  bounds  be¬ 
come  short  enough  to  achieve  a  desired  precision.  This  arithmetic  is  thus  a 
particular  form  of  constructive-real  arithmetic.  We  were  unable  to  critically 
examine  the  time  and  space  costs  of  operations  using  this  technique. 
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Aberth  [Abe88]  describes  one  method  of  producing  progressively  shorter 
interval  bounds  with  a  version  of  range  arithmetic  in  which  the  range  is 
limited  to  one  digit  in  the  position  of  the  mantissa’s  least  significant  digit 
and  the  mantissa  loses  digits  if  the  range  becomes  larger.  His  system  simply 
repeats  all  calculations  to  a  higher  precision  —  a  larger  number  of  mantissa 
digits  —  until  the  final  results  contain  a  desired  number  of  accurate  digits. 

The  ACRITH  package  from  IBM  {BRR85,KM83b]  seems  to  use  a  more 
sophisticated  algorithm  for  refining  interval  approximations.  It  apparently 
represents  a  real  as  an  initial  value  plus  a  potentially  infinite  series  of  pro¬ 
gressively  smaller  correction  terms.  It  apparently  produces  interval  bounds 
on  results  by  truncating  these  series  of  correction  terms,  and  computes  ap¬ 
propriate  correction  terms  for  the  results  of  operations  by  forming  symbolic 
products  of  series  and  evaluating  the  initial  terms  of  these  series  to  obtain 
bounds  consistent  with  the  current  level  of  precision.  We  gathered  these 
impressions  of  the  ACRITH  package  from  Kahan’s  [KL85]  critique  of  it,  a 
critique  that  emphcisized  its  sometimes  excessive  uses  of  time  and  space. 

The  variable-length  p-adic  representation  by  Horspool  and  Hehner  [HH78] 
might  be  a  significant  extension  of  the  finite  p-adic  representation  described 
in  Section  5.7,  but  we  suspect  that  it  also  suffers  from  the  defect  of  not 
providing  a  natural  means  for  discarding  information. 


8.5  Experiments  on  Boehm’s  Package 

As  we  noted  in  Chapter  6,  we  were  unable  to  test  the  time  and  memory-use 
performances  of  Boehm’s  constructive-real  package  in  finding  the  determi¬ 
nant  of  a  large  matrix,  .^s  we  also  noted  in  Chapter  6,  however,  we  did 
identify  means  for  doing  so,  and  did  obtain  an  expert  opinion  from  Boehm 
on  the  issue  the  determinant  experiment  was  intended  to  address.  We  «Jso 
obtained  the  Russell  matrix  arithmetic  package  and  one  of  the  new  versions 
of  the  Russell  compiler  that  contains  C-callable  versions  of  the  constructive- 
real  arithmetic  routines,  so  we  can  perform  the  determinant  experiment  in 
the  future  if  asked  to  do  so. 


102 


8.6 


Aircraft  Interception  Example 


We  intended  to  examine  code  computing  an  interception  path  for  aircraft 
[VS86]  for  situations  where  highly-parallel  computation  or  intermediate  re¬ 
sults  computed  by  a  constructive-real  package  might  be  useful.  We  aban¬ 
doned  work  on  this  example  because  we  could  not  figure  out  the  physics 
assumed  by  the  algorithm,  and  because  the  Reals  project  was  asked  to  look 
at  a  hostile  booster  trajectory  estimation  algorithm  [App87]  instead. 

We  found  one  potential  situation  where  on-demand  precision,  which  is 
related  to  constructive-real  arithmetic,  might  be  useful  in  the  small  fragment 
of  this  code  that  we  examined.  In  this  situation,  the  code  computes  directions 
by  dividing  vectors  by  their  norms,  even  though  these  norms  can  be  0  or  near 
0.  The  resulting  directions  are  actually  not  significant  is  this  case,  though, 
so  we  considered  the  situation  a  programming  error  rather  than  a  potential 
apphcation  for  constructive-real  arithmetic. 


8.7  Errata 

Chapter  3  of  our  interim  report  contains  two  technical  errors.  First,  the  real 
number  whose  standard  continued  fraction  is  [1, 1, 1,. . .]  is  not  \/2,  but  the 
golden  ratio  (1  -t-  \/^)/2.  The  correct  standard  continued  fraction  for  \/2  is 
[1,2,2,...]. 

Our  interim  report  also  madces  the  statement,  “This  property  [the  best 
rational  approximation  property]  has  &s  a  consequence  that  finite  initial  por¬ 
tions  of  numbers’  continued  fraction  representations  produce,  on  average, 
better  approximations  to  the  numbers  per  amount  of  information  stored  than 
do  any  other  rational  representations,  including  ordinary  base- 6  notation.” 
This  statement  was  based  on  a  misunderstanding  of  the  “best  rational  ap¬ 
proximation”  property  of  continued  fractions.  The  results  by  Matula  and 
Komerup  [KM85]  on  the  gap  sizes  between  consecutive  LCF  values,  given 
in  Subsection  5.2.2,  can  be  construed  as  saying  that  the  representation  effi¬ 
ciency  of  the  LCF  encoding  of  continued  fractions  is  asymptotically  the  same 
as  that  of  the  ordinary  binary  fixed-point  representation. 
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Appendix  A 

IEEE  Interval  Arithmetic 


This  appendix  describes  a  particular  version  of  interval  arithmetic  that  can  be 
implemented  on  any  machine  meeting  the  IEEE  standard  for  binary  floating¬ 
point  arithmetic  [IEE^5].  Appendix  B  gives  an  implementation  of  this  arith¬ 
metic  for  Sun  computers.  The  “arithmetic”  includes  operations  such  as  in¬ 
terval  intersection  that  do  not  correspond  to  operations  on  real  numbers  but 
are  often  used  in  interval  algorithms,  as  in  the  Interval  Newton’s  Method 
algorithm  implemented  by  code  in  Appendix  E. 


A.l  Extended  Real-Number  Arithmetic 

Define  the  set  R'  by  R'  =  (R  \  {0})  U  {-|-0,  — 0,d-oo,— oo},  where  -fO,  -0, 
-1-00  and  — oo  are  new  symbols.  The  values  -|-0  and  -0  behave  as  positive  and 
negative  inflnitesim2ds,  respectively,  and  the  values  -foo  and  — oo  behave  as 
positive  and  negative  infinity.  These  new  vdues  correspond  to  possible  val¬ 
ues  in  IEEE  floating-point  arithmetic,  which  is  described  in  this  appendix’s 
Section  3.  Extend  the  usual  order  <  on  R  to  a  similar  order  <'  on  R'  by 
saying  that  for  every  positive  real  x,  — oo  <’  —x  <'  —0  <'  -t-0  <'  x  <'  -foo. 

Define  the  set  R"  by  R"  =  R' U  {NaN, -fNaN, -NaN}.  The  NaN  value 
corresponds  to  the  “not  a  number”  value  used  in  IEEE  floating-point  arith¬ 
metic  as  the  result  of  invalid,  completely  indeterminate  operations.  The 
-f  NaN  and  -NaN  values  correspond  to  values  that  are  know  only  to  be  pos- 
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itive  or  negative,  respectively.  They  do  not  correspond  to  values  in  IEEE 
arithmetic,  but  will  be  used  in  this  appendix’s  Section  4  to  define  operations 
on  intervals. 

Extend  the  operations  in  {+,—,••,/}  on  R  to  R"  consistently  with  the 
theory  of  limits.  In  particular. 


-(±oo) 

= 

Too 

l/(±0) 

= 

±oo 

l/(±oo) 

= 

±0 

(±oo)  -1-  (too) 

= 

NaN 

(±oo)  -  (±oo) 

= 

NaN 

-1-0 --1-00 

= 

+NaN 

-fO  •  — oo 

= 

-NaN 

—0  •  -foo 

= 

-NaN 

—0  •  — oo 

+NaN 

iO/d-O 

= 

±NaN 

±0/-0 

= 

TNaN 

±oo/  -1-  oo 

= 

±NaN 

±oo/  —  oo 

=: 

TNaN 

Let  any  operation  with  NaN  as  one  of  its  arguments  have  NaN  as  its  result. 
Let  an  operation  with  +NaN  or  -NaN  as  one  of  its  arguments  have  the 
result  consistent  with  only  that  argument’s  sign  being  known.  In  particular, 
-(diNaN)  =  TNaN,  +NaN  +  +NaN  =  +NaN,  and  +NaN  +  -NaN  =  NaN. 

Do  not  extend  the  order  <'  to  R".  In  IEEE  floating-point  arithmetic, 
any  comparison  operation  involving  NaN  returns  NaN  as  its  result;  the  com- 
peirison  operations  do  not  even  identify  NaN  as  being  equal  to  itself. 


A. 2  IEEE  Floating-Point  Arithmetic 

Following  the  IEEE  standard,  let  M,„_„  for  positive  integers  m  and  n  be  the 
set  of  all  real  numbers  of  the  form 

i:significand  ■ 

where  0  <  significand  <  2,  significand  is  an  integral  multiple  of  2"**,  and 
— m  <  exponent  <  m.  The  values  of  m  and  n  vary  on  different  machines. 
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Assume  that  m  and  n  are  fixed,  and  let  M  =  U  {+0,  — 0, +oo,  — oo}. 
M  denotes  the  machine-representable  “numerical”  values  on  a  machine  sup¬ 
porting  IEEE-standard  floating-point  arithmetic. 

Let  a  rounding  be  a  function  O  :  R'  — ►  M  satisfying: 

Vi  G  M  (□!  =  x)  and 
Vx,yeR'  (a:  <  j/ □!  <  Dy). 

Note  that  these  properties  have  as  a  consequence  that  Di  =  x  or  Ox  is  one 
of  the  two  representable  values  closest  to  x. 

Say  that  a  rounding  □  is  upward  if  Ox  >  x,  and  downward  if  Ox  <  i. 
The  downward  rounding  of  zero  is  -0  and  the  upward  rounding  of  zero  is  -f-0. 
The  IEEE  standard  also  defines  two  other  types  of  roundings,  toward  zero 
and  to-nearest,  and  calls  for  the  maintenance  of  a  rounding  mode  that  selects 
one  of  these  four  roundings  as  the  current  rounding.  The  programmer  can 
change  the  rounding  mode  at  will.  For  *  G  {+,  — ,  •,/},  the  standard  calls  for 
the  corresponding  machine  operation  to  satisfy 

x*My  =  0(x*y) 

for  ail  machine-representable  reals  x  and  y,  where  O  is  the  current  rounding. 
This  note  will  not  require  the  full  generality  of  the  IEEE  standard,  but 
will  use  upward  and  downward  roundings  in  defining  the  interval  arithmetic 
computations.  The  asymptotic  version  of  the  interval  semantics  also  depends 
on  properties  of  these  roundings. 


A.3  Machine  Interval  Arithmetic 

For  values  Uj  and  Uj  in  R',  with  Ui  <'  cj,  call  a  subset  of  R'  of  the  form 
A  =  [01,02]  =  {t  G  R'loi  <  ^  <  02} 

an  interval.  Note  that  -fO  and  -0  are  distinct  as  possible  endpoints.  Let 
the  special  value  EMPTY,  denoting  the  empty  set,  be  an  interval,  and  let 
P  OS  INF  and  NEGINF  denote  the  intervals  [-t-00, -foo]  and  [—00, —00],  re¬ 
spectively.  POSINF  can  be  thought  of  as  an  infinite  interval  of  positive 
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reals  whose  magnitudes  are  all  too  large  to  be  machine-representable,  and 
NEGINF  as  a  similar  interval  of  negative  numbers. 

The  set  of  possible  intervals  contains  intervals  that  exactly  correspond  to 
each  of  the  possible  values  representable  in  IEEE  arithmetic,  as  well  as  to 
the  -|-NaN  and  -NaN  values  introduced  in  this  appendix’s  Section  2.  The 
real  number  x  corresponds  to  the  point-interval  [x,  x];  the  values  -|-oo  axid 
— oo  correspond  to  the  intervals  POSINF  and  NEGINF,  respectively;  the 
values  -|-NaN  and  -NaN  correspond  to  the  intervals  [-}-0,-foo]  and  [—00,  — 0], 
respectively;  and  NaN  corresponds  to  the  interval  [— cio,-foo]. 

For  *  G  {+1  —  7  •»/}  and  intervals  A  and  B,  define  the  interval  operation 
A*  B  as  follows:  Consider  the  intermediate  set  VALUES  defined  by 

VALUES  =  {a*b\aeA,be  B}, 

where  the  operations  are  performed  in  R".  Define  new  intermediate  sets 
KNOWN  and  UNKNOWN  by 

KNOWN  =  VALUES  \  {-t-NaN,  -NaN,  NaN}, 

and 

UNKNOWN  =  VALUES  n  {-|-NaN,  -NaN,  NaN}. 

Replace  the  values  -f-NaN,  -NaN  and  NaN  in  UNKNOWN  by  the  intervals 
[-f0,-l-oo],  [— 00,  — 0]  and  [— oo,-foo],  respectively,  and  let  UNCERTAIN  be 
the  union  of  the  members  of  the  resulting  set.  Finally,  let  be  the  union 
of  KNOWN  and  UNCERTAIN. 

Now  define  the  machine-representable  intervals  and  the  operations  on 
them.  Let  M  be  the  set  of  machine-representable  “numerical”  values  for 
a  particular  machine.  Let  j  (1)  be  an  upwau'd  (downward)  rounding  that 
rounds  to  values  in  M.  Define  a  conservative  rounding  on  intervals  by 

I  A  =  I  [01,02]  =  [1  01,1  02]. 

Finally,  define  the  machine  interval- arithmetic  operations  for  *  G  — ,  •,  /} 
by 

A  B  =  X  {A*  B). 

The  rest  of  this  appendix  will  only  consider  machine-representable  inter¬ 
vals,  so  take  “interval”  as  an  abbreviation  for  “machine-representable  inter¬ 
val”  from  now  on.  Note  that  all  operations  are  defined  for  all  possible  pairs 
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of  intervals.  The  abilitj  to  compute  interval  operations  is  dependent  on  the 
availability  of  appropriate  upward  and  downward  rounding  functions;  such 
functions  are  available  in  hardware  for  machines  meeting  the  IEEE  floating¬ 
point  standard. 


A.4  Semantics  of  Interval  Operations 

Assume  that  the  machine  on  which  the  interval  operations  are  performed 
maintains  a  finite  collection  of  flags  called  the  status,  where  each  flag  in  the 
status  has  value  SET  or  CLEAR.  Following  the  IEEE  standard,  this  appendix 
considers  inexact,  invalid-operation,  divide-by-zero,  overflow  and 
underflow  as  possible  status  flags.  Assume  programs  cam  test  and  set  the 
state  of  the  status  flags  by  making  appropriate  system  calls,  and  assume, 
following  the  IEEE  standard,  that  all  status  flags  are  initially  CLEAR  but 
that  once  a  status  flag  becomes  set  it  stays  set  until  the  program  makes  a 
system  call  to  reset  it  to  CLEAR.  Say  that  an  operation  causes  an  exception 
if  it  causes  the  status  flag  for  a  condition  to  be  set  if  that  flag  is  not  already 
set.  Let  each  status  flag  name  the  exception  consisting  of  causing  that  flag  to 
become  SET.  Note  that  a  single  operation  can  cause  more  than  one  exception. 

Further,  assume  that  the  constants  -}-0,  -0,  1,  EMPTY,  POSINF  and 
NEGINF,  denoting  the  intervals  [-1-0, -fO],  [-0,-0],  [1,1],  [1,0],  [-f-oo, -l-c»] 
and  [— oo,— oo],  respectively,  are  available  to  programs.  (The  subroutine 
doinits  in  Appendix  B  creates  the  necessary  constants  without  causing  any 
exceptions  in  the  process;  these  constants  will  presumably  be  more  readily 
available  after  extensions  to  C  are  chosen  to  take  advantage  of  items  available 
in  IEEE  arithmetic.)  Also  Msume  the  binary  functions  ,-,/,U  and  (1, 
and  the  unary  functions  length,  left-end,  right-end  and  mid-point  are 
available. 

For  intervals  Q  and  R,  the  operation  Q  *  R,  for  *  G  {+,•,—,/},  is  as 
defined  above.  The  operation  causes  the  following  exceptions  in  the  given 
conditions;  otherwise  the  transition  does  not  cause  any  exceptions: 

1.  An  inexact  exception  occurs  whenever  the  machine  interval  produced 
as  the  result  of  the  operation  is  not  the  exact  result  for  the  operation. 
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2.  An  overflow  or  underflow  exception  can  occur  whenever  an  inexact 
exception  occurs.  Whether  or  not  one  of  these  exceptions  occurs  on 
a  particular  operation  with  particular  arguments  will  vary  with  the 
machine,  the  operation,  the  arguments,  and  the  details  of  exactly  how 
the  interval  operations  are  implemented. 

3.  A  divide-by-zero  exception  occurs  whenever  the  interval  being  di¬ 
vided  by  contains  either  -|-0  or  -0. 

4.  An  invalid-operation  exception  occurs  in  each  of  these  cases:  one 
or  both  of  the  arguments  to  an  operation  is  EMPTY ;  the  operation  is 
the  sum  of  two  intervals  and  one  has  -|-oo  its  upper  bound  and  the 
other  has  —  oo  as  its  lower  bound;  the  operation  is  the  difference  of  two 
intervals  and  both  intervals  have  -|-oo  (— cx>)  as  the  same  bound  (i.e., 
both  upper  or  both  lower);  the  operation  is  interval  multiplication,  one 
of  the  intervals  is  not  finite  (i.e.,  if  it  has  -|-oo  or  — oo  as  a  bound),  and 
the  other  contains  -|-0  or  -0;  the  operation  is  interval  division,  both 
intervals  contain  some  zero  (i.e.,  -|-0  or  -0),  or  both  have  at  least  one 
infinite  bound. 

The  U  of  two  intervals  is  the  interval  from  the  least  point  to  the  greatest 
point  (under  the  order  <')  in  the  union  of  the  intervals.  The  D  of  two  intervals 
is  the  (possibly  EMPTY)  intersection  of  the  intervals.  Neither  operation 
causes  any  exceptions. 

The  length  of  an  interval  is  an  interval  containing  the  interval’s  length.  If 
the  interval  is  of  infinite  length,  which  it  will  be  if  it  is  POSINF  or  NEGINF, 
then  its  length  is  POSINF.  The  length  of  EMPTY  is  taken  to  be  0.  If  the 
value  of  a  length  operation  is  not  a  point  interval  or  POSINF  then  the 
operation  causes  an  inexact  exception;  otherwise  it  causes  no  exceptions. 

The  left-end  (right-end)  of  a  nonempty  interval  is  the  point  interval 
containing  the  possibly-infinite  least  (greatest)  point  in  the  interval.  The 
left  and  right  ends  of  EMPTY  are  taken  to  be  POSINF  and  NEGINF,  re¬ 
spectively,  the  left  and  right  ends  of  POSINF  are  both  taken  to  be  POSINF, 
and  the  left  and  right  ends  of  NEGINF  are  both  taken  to  be  NEGINF.  The 
left-end  operation  on  an  EMPTY  or  POSINF  interval,  and  the  right-end 
operation  on  an  EMPTY  or  NEGINF  interval,  cause  an  invail id-operation 
exception;  otherwise  these  operations  cause  no  exceptions. 
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Appendix  B 

IEEE  Interval  Operations 


This  appendix  contains  a  copy  of  the  file  included  as  intops .  c  in  the  interval 
algorithm  programs  given  in  Appendices  C  and  E.  It  implements  the  interval 
operations  specified  in  Appendix  A.  It  also  contains  the  function  litprint 
for  showing  the  binary  values  of  IEEE  double-precision  floating-point  values 
as  they  are  implemented  on  Sun  machines,  and  the  function  doinits  for 
creating  infinite  constants  without  causing  exceptions. 

This  code  is  written  to  run  under  Release  3.5  of  the  Sun  UNIX  4.2  oper¬ 
ating  system.  It  maJces  the  changes  in  the  rounding  mode  necessary  to  give 
optimally-rounded  intervals  with  calls  to  the  f  pmode.  system  call,  which  has 
been  replaced  with  a  different  system  call  in  more  recent  releases  of  the  Sun 
operating  system. 

The  code  begins  on  the  next  page. 
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/♦  LITMAX  should  be  as  large  as  the  largest  number  of  */ 
/*  calls  to  litprint  in  a  single  invocation  of  printf  */ 

•define  LITMAX  10 
•define  LITLNGTH  20 

char  ♦litprint (x) 
double  x; 

{ 

static  cheir  litstrings  [LITMAX]  [LITLNGTH]  ; 

static  int  lit index  =0;  /♦  compiler  init  needed  ♦/ 

char  *pstring,*sprintf () ; 

unsigned  short  *px; 

px  =  (unsigned  short  *)  &x; 
pstring  =  litstrings [lit  index] ; 

(void)  sprintf  (pstring, "*/.04x  5(04x  %04x  ‘/.04x", 

♦px , ♦ (px+ 1 ) , ♦ (px+2 ) , ♦ (px+3) ) ; 


++1 it index; 

ifditindex  ==  LITMAX) 
lit index  =  0; 

return(pstring) ; 

} 


void  doinitsO 

{ 

static  unsigned  INFPART  =  0x7ff 00000; 
unsigned  ♦punsign; 
double  temp; 


punsign  =  (unsigned  ♦)  fttemp; 
♦punsign  =  INFPART; 
♦(punsign+1)  =  0; 
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PLUINF  =  temp; 
HININF  =  -temp: 
POSINF.l  =  PLUINF 
POSINF.r  =  PLUINF 
NEGINF.l  =  MININF 
NEGINF.r  =  MININF 
} 


struct  interval  intsum(inti,int2) 
struct  interval  intl,int2; 

struct  interval  result; 


if(intl.l  >  intl.r  II  int2.1  > 
result. 1  =  PLUINF  +  MININF; 
result  =  EMPTY; 

} 

else  { 

newmode  =  2*64  +  2*16; 
oldmode  »  fpmode_(ftnewmode) ; 
result. 1  =  intl.l  +  int2.1; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(&newmode) ; 
result. r  =  intl.r  +  int2.r; 
newmode  =  fpmode_(ftoldmode) ; 
} 

return (result) ; 


int2.r)  {/*  EMPTY  argument  */ 
/*  Cause  exception  */ 
/*  EMPTY  result  */ 


struct  interval  intdiff (intl,int2) 
struct  interval  intl,int2; 

{ 

struct  interval  result; 

if(intl.l  >  intl.r  II  int2.1  >  int2.r)  {/*  EMPTY  argiiment  */ 
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/*  Cause  exception  */ 
/*  EMPTY  result  */ 


result. 1  =  PLUINF  +  MININF; 
result  =  EMPTY; 

} 

else  { 

newmode  =  2*64  +  2*16; 
oldmode  =  fpmode_(ftnevmode) ; 
result. 1  =  intl.l  -  int2.r; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(&newmode) ; 
result. r  =  intl.r  -  int2.1; 
newmode  =  fpmode_(ftoldmode) ; 
} 

return(result) ; 

} 


struct  interval  intprod(intl,int2) 
struct  interval  intl,int2; 

{ 

int  i,j,isinf(); 

double  t emp 1 , t emp2 , t emp3 , temp4 , low , high ; 
struct  interval  result; 
unsigned  signl ,sign2,*punsign; 

ifCintl.l  >  intl.r  ||  int2.1  >  int2.r)  {/*  EMPTY  aurgument  */ 
result. 1  =  PLUINF  +  MININF;  /*  Cause  exception  */ 

result  =  EMPTY;  /*  EMPTY  result  */ 

} 

else  { 

if (intl.l==0.0| I 
intl.r==0.0| I 
int2.1==0.0| I 
int2.r==0.0| 1 
isinf (inti .1) I | 
is inf (inti .r) I | 
isinf (int2.1) I | 
isinf (int2 .r) 
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case  */ 


)  { 

/♦  "Special  calculations  needed" 

low  =  PLUINF; 
high  »  MININF; 
for(i=0;  i<2;  ++i)  { 

tempi  =  (i=*0)?intl.l:intl.r; 
for(j=0;  j<2;  ++j)  i 

temp2  =  (j“»0)?int2.1:int2.r; 
if ((tempi  =«  0.0  kk  isinf (temp2))  II 
(temp2  »  0.0  kk  isinf (tempi)) 

)  { 

/*  NaN  case  ♦/ 

punsign  -  (unsigned  *)  fttempl; 
signl  =  Kpunsign  »  31; 
punsign  «  (unsigned  *)  fttemp2; 
sign2  *=  *punsign  »  31; 
if ((signl  kk  sign2)|| 

(! signl  kk  !sign2) 

)  { 

if (low  >  0.0) 
low  =  0.0; 
high  =  PLUINF; 

} 

else  -( 

if (high  <  0.0)  { 
high  »  0.0; 
high  *=  -1; 

} 

low  -  MININF; 

} 

} 

else  { 
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/*  Normal  case  ♦/ 


neumode  =  2*64  +  2*16; 
oldmode  =  fpmode_(ftne«mode) ; 
temp3  =  tempi  *  temp2; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(ftnewmode) ; 
temp4  =  tempi  *  tefflp2; 
if(temp3  <  low) 
low  =  temp3; 
else  -( 

if (temp3==0 . 0  low==0.0)  { 

punsign  =  (unsigned  *)  fttemp3; 
signl  =  *punsign  »  31; 
punsign  =  (unsigned  *)  tlow; 
sign2  =  *punsign  »  31; 
if (signl  !sign2) 
low  =  temp3; 

} 

} 

if(temp4  >  high) 
high  ®  temp4; 
else  { 

if (temp4==0.0  kk  high==0.0)  { 
punsign  =  (unsigned  *)  fttemp4; 
signl  =  *punsign  »  31; 
punsign  =  (unsigned  *)  fthigh; 
sign2  =  *punsign  »  31; 
if (! signl  kk  sign2) 
high  =  temp4; 

} 

} 

} 

} 

} 

result. 1  =  low; 
result. r  =  high; 
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/«  cause  exception  if  called  for  */ 

if ( (inti. 1  <*  0.0  tk 
intl.r  >=  0.0  kk 

(isinf (int2.1)  ||  isinf (int2.r)) 

)  II 

(int2.1  <-  0.0  kk 
int2.r  >=  0.0  kk 
(isinf (inti. 1)  II  isinf (intl.r)) 

) 

) 

temp3  =  PLUINF  +  MININF; 

} 

else  -( 

/*  "Old-fashioned  code  sufficient"  case  ♦/ 

newmode  =  2*64  +  2*16; 
oldmode  =  f pmode_(toeumode) ; 
if(intl.l  >  0.0)  { 
if(int2.1  >  0.0)  < 

result. 1  =  intl.l  *  int2.1; 
newmode  *  2*64  +  3*16; 
newmode  ^  fpmode_(£newmode) ; 
result. r  =  intl.r  *  int2.r; 

> 

else  { 

if(int2.r  <  0.0)  { 

result. 1  *  intl.r  *  int2.1; 
newmode  *  2*64  +  3*16; 
newmode  «  fpmode_(tnewmode) ; 
result. r  =  intl.l  *  int2.r; 

} 

else  { 

result. 1  ■  intl.r  *  int2.1; 
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newnode  *>  2*64  +  3*16; 
newmode  «  fpmode_(ftnewmode); 
result. r  *  intl.r  ♦  int2.r; 

} 

} 

else  { 

if  (intl.r  <  0.0)  ■( 
if(int2.1  >0.0)  { 

result. 1  «  intl.l  ♦  int2.r; 
newmode  «  2*64  +  3*16; 
newmode  *  fpinode_(ftnewmode)  ; 
result. r  *  intl.r  ♦  int2.1; 

else  { 

if(int2.r  <  0.0)  { 

result. 1  =  intl.r  *  int2.r; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode^C&newmode) ; 
result. r  *  intl.l  *  int2.1; 

else  { 

result. 1  =  intl.l  *  int2.r; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(&newmode) ; 
result. r  =  intl.l  *  int2.1; 

} 

} 

else  { 

if(int2.1  >  0.0)  { 

result. 1  =  intl.l  *  int2.r; 
newmode  «  2*64  +  3*16; 
newmode  =  fpnode_(ftnewmode) ; 
result. r  *  intl.r  ♦  int2.r; 

else  { 
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if(int2.r  <  0.0)  { 
result.  1  =  intl.r  * 
newmode  =  2*64  +  3*16; 
newmode  -  f pmode. (ftnewmode) ; 
result. r  =  intl.l  *  int2.1; 

} 

else  { 

tempi  =  intl.l  *  int2.r; 
temp2  =  intl.r  ♦  int2.1; 
result. 1  =  (tempi  <=  temp2)?templ :temp2; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode. (ftnewmode) ; 
tempi  »  intl.l  *  int2.1; 
temp2  =  intl.r  ♦  int2.r; 
result. r  =  (tempi  >=  temp2)?templ :temp2; 
} 

} 

} 

} 

newmode  ®  fpmode^(ftoldmode) ; 

} 

} 

return (result) ; 

} 


struct  interval  intquot(intl,iiit2) 
struct  interval  intl,int2; 

{ 

int  i,j,isinf(); 

double  t emp 1 , t  emp2 , t emp3 , t emp4 , low , high ; 
struct  interval  result; 
unsigned  signl,8ign2,*punsign; 

if(intl.l  >  intl.r  II  int2.1  >  int2.r)  {/*  EMPTY  aurgument  ♦/ 
result. 1  ■  PLUIMF  +  MININF;  /*  Cause  exception  */ 

result  ■  EMPTY;  /*  EMPTY  result  */ 
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> 

else  { 

if (intl.l==0.0|  I 
intl.r==0.0|  | 
int2.1==0.0|  I 
int2.r==0.0| | 
isinf (intl.l) |  | 
isinf (intl.r) | I 
isinf (int2.1) I  I 
isinf (int2.r) I  I 

(int2.1  <  0.0  &&  int2.r  >  0.0) 

)  { 

/*  "Special  calculations  needed"  case  */ 

if(int2.1  <  0.0  kk  int2.r  >  0.0)  { 
result. 1  =  MININF; 
result. r  =  PLUINF; 
temps  =  0.0; 

temp4  =  l.O/tempS;  /*  give  0-divide  exception  */ 

> 

else  { 

low  =  PLUINF: 
high  =  MININF; 
for(i=0;  i<2;  ++i)  { 

tempi  =  (i==0)?intl . 1 : inti .r; 
for(j=0;  j<2;  ++j)  < 

temp2  =  (j==0)?int2.1:int2.r; 
if  ((tempi  ==  0.0  kk  temp2  ==  0.0)  || 

(isinf (tempi)  kk  isinf (temp2)) 

)  { 

/*  NaN  case  ♦/ 

punsign  =  (unsigned  *)  ttempl; 
signl  =  *punsign  »  31; 
punsign  =  (unsigned  *)  fttemp2; 
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sign2  =  *pun£tign  »  31; 
if((signl  sign2)|| 

(Isignl  &&  !sign2) 

)  { 

if (low  >0.0) 
low  =  0.0; 
high  *  PLUINF; 

} 

else  { 

if (high  <  0.0)  { 
high  =0.0; 
high  *=  -1; 

} 

low  =  MININF; 

} 

/*  cause  0-divide  if  needed  */ 

if(temp2  ==  0.0)  { 
temp3  =  0.0; 
temp4  =  1.0/temp3; 

} 


> 

else  { 

/*  Normal  case  */ 

newmode  =  2*64  +  2*16; 
oldmode  =  fpmode_(&newmod6) ; 
temp3  =  tempi  /  temp2; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(4newmode) ; 
temp4  =  tempi  /  temp2; 
if(temp3  <  low) 
low  ■  temp3; 
else  { 
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if (tejnp3==0.0&Alow==0.0)  { 

punsign  =  (luisigned  *)  &temp3; 
signl  =  ♦punsign  »  31; 
punsign  =  (unsigned  *)  &lov; 
sign2  =  *punsign  »  31; 
if (signl  kk  !sign2) 
low  =  temp3; 

> 

} 

if(temp4  >  high) 
high  =  teinp4; 
else  -( 

if (temp4==0.0A&high==0.0)  { 

ptinsign  =  (unsigned  *)  &temp4; 
signl  =  *punsign  »  31; 
punsign  =  (unsigned  *)  &high; 
sign2  =  *punsign  »  31; 
if (! signl  kk  sign2) 
high  =  temp4; 

} 

> 

} 

} 

} 

result. 1  =  low; 
result. r  =  high; 

} 

/♦  cause  invalid-op  exception  if  needed  */ 

if ( (inti. 1  <=  0.0  kk 

inti .r  >=  0.0  kk 

int2.1  <=  0.0  kk 

int2.r  >=  0.0 

)  II 

((isinf (intl.l)  ||  isinf (inti ,r))  kk 
(isinf (int2.1)  II  isinf (int2.r)) 
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) 

temp3  =  PLUINF  +  MININF; 

} 

else  { 

/*  "Old-fashioned  code  sufficient"  case 

newmode  =  2*64  +  2*16; 
oldmode  =  fpmode.C&newmode) ; 
if (inti. 1  >  0.0)  { 
if(int2.1  >  0.0)  { 

result. 1  =  intl.l  /  itit2.r; 
newmode  *  2*64  +  3*16; 
newmode  -  fpmode_(ftnewmode); 
result. r  =  intl.r  /  int2.1; 

} 

else  { 

result. 1  »  intl.r  /  int2.r; 
newmode  «  2*64  +  3*16; 
newmode  =  fpmode_(4newmode) ; 
result. r  =  intl.l  /  int2.1; 

} 

} 

else  ■( 

ifCintl.r  <  0.0)  { 
if(int2.1  >0.0)  { 

result. 1  =  intl.l  /  int2.1; 
newmode  =  2*64  +  3*16; 
newmode  »  fpmode_(&newmode) ; 
result. r  =  intl.r  /  int2.r; 

} 

else  { 

result. 1  =  intl.r  /  int2.1; 
newmode  =  2*64  +  3*16; 
newmode  ■>  fpmode_(ftnewmode) ; 


result. r  =  intl.l  /  int2.r; 

} 


} 

else  { 

if(int2.1  >  0.0)  { 

result. 1  =  intl.l  /  int2.1; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(&newmode) ; 
result. r  =  intl.r  /  int2.1; 

} 

else  { 

result. 1  =  intl.r  /  int2.r; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(&newmode) ; 
result. r  =  intl.l  /  int2.r; 

} 

} 

> 

newmode  =  fpmode  (ftoldmode) ; 

> 

} 

return(result) ; 

> 


struct  interval  intsqrt(intx) 
struct  interval  intx; 

{ 

struct  interval  result; 

ifCintx.l  >  intx.r  II  intx.r  <  0.0)  { 

intx.l  =  PLUINF  +  MININF;  /*  generate  exception  */ 
result  =  EMPTY; 

} 

else  { 

newmode  »  2*64  +  2*16;  /*  round  lower  end  down  */ 

oldmode  =  fpmode. (ftnewmode) ; 
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ifCintx.l  <  0.0)  { 
result. 1  =  0.0; 
result. 1  *=  -1.0* 
} 


else 

result. 1  *  sqrt(intx.l); 

newmode  =  2*64  +  3*16;  /*  round  upper  end  up  */ 

newmode  -  fpmode_(&newmode) ; 
result. r  =  sqrt(intx.r) ; 
newmode  =  f prnode. (ftoldmode) ; 

return (result) ; 

} 


struct  interval  intunion(x,y) 
struct  interval  x,y; 

struct  interval  result ; 

ifCx.l  >  x.r) 
result  =  y; 
else  { 

if(y.l  >  y.r) 
result  =  x; 
else  { 

result. 1  =  (x.l  <=  y.l)?x.l:y.l; 
result. r  =  (x.r  >*  y.r)?x.r:y.r; 

} 

retum(result) ; 

} 


struct  interval  intinter(x,y) 
struct  interval  x,y; 
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struct  interval  result; 


result. 1  =  (x.l  >=  y.l)?x.l:y.l; 
result. r  =  (x.r  <=  y .r)?x.r:y.r; 
if (result. 1  >  result. r) 
result  =  EMPTY; 
return (result) ; 

} 


struct  interval  intlength(intx) 
struct  interval  intx; 

{ 

struct  interval  result; 

if(intx.l  >=  intx.r)  { 
result. 1  =  0.0; 
result. r  =  0.0; 

} 

else  { 

newmode  =  2*64  +  2*16; 
oldfflode  -  fpinode_(ftneviDode) ; 
result. 1  =  intx.r  -  intx.l; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(&newmode) ; 
result. r  =  intx.r  -  intx.l; 
newmode  =  fpmode_(&oldmode) ; 
} 

return(result) ; 

} 


struct  interval  leftend(intx) 
struct  interval  intx; 

{ 

struct  interval  result ; 
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ifCintx.l  >  intx.r)  { 

intx.l  =  PLUINF  +  MININF ;  /*  generate  exception  */ 

result  =  POSINF; 

} 

else  { 

if (intx.l  ==  PLUINF) 

intx.l  =  PLUINF  +  MININF;/*  generate  exception  */ 
result. 1  =  intx.l; 
result. r  =  intx.l; 

} 

retum(result) ; 

} 


struct  interval  rightend(intx) 
struct  interval  intx; 

( 

struct  interval  result; 
if (intx.l  >  intx.r)  { 

intx.l  =  PLUINF  +  MININF;  /♦  generate  exception  */ 
result  =  NEGINF; 

} 

else  -( 

if (intx.r  ==  MININF) 

intx.l  =  PLUINF  +  MININF;/*  generate  exception  */ 
result. 1  =  intx.r; 
result. r  =  intx.r; 

} 

return (result) ; 

} 
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Appendix  C 


FFT  Multiplication  Programs 


The  following  C  programs  implement  versions  of  an  algorithm  tr>  use  Fast 
Fourier  Transforms  to  multiply  large  integers.  The  first  two  give  interval 
versions  of  this  algorithm,  and  the  last  one  gives  a  scalar  version.  One  interval 
program  is  for  a  Sun  and  the  other  is  for  a  VAX.  The  scalar  program  is  for 
a  Sun  but  can  be  easily  modified,  by  following  comments  in  the  code,  to 
behave  similarly  on  a  VAX.  If  these  programs  are  run  with  reasonably  small 
integers  as  inputs,  the  exact  results  of  the  Fast  Fourier  Transform  algorithm 
can  be  calculated  with  machine  integer  multiplication,  so  the  lengths  of  the 
intervals  the  programs  produce  are  greater  than  zero,  or  the  floating-point 
answers  they  produce  differ  from  the  exact  results,  only  because  of  error 
ciccumulated  in  doing  the  calculations. 

As  explained  in  Chapter  2,  these  programs  and  their  outputs  show  that, 
at  least  for  moderately-complicated  calculations,  having  additional  bits  in  the 
mantissas  of  the  floating-point  values  being  calculated  has  a  greater  influence 
on  precision  than  having  optimally-rounded  results  does.  They  also  illus¬ 
trate  that  interval  algorithms  obtained  by  simply  reinterpreting  the  values  of 
variables  as  intervals  and  the  arithmetic  operations  on  these  veu’iables  as  cor¬ 
responding  interval  operations  tend  to  produce  results  whose  error  bounds 
are  much  larger  than  the  actual  errors  in  the  corresponding  floating-point 
calculations. 
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C.l  IEEE  Interval  FFT  Multiply 


The  first  program  uses  IEEE  double-precision  floating-point  arithmetic  as  im¬ 
plemented  on  a  Sun  3/60  having  an  MC68881  coprocessor  with  maisk  A93N. 
Its  interval  operations  force  results  to  be  rounded  correctly  by  calling  f  pmode- 
under  Release  3.5  of  the  Sun  UNIX  4.2  operating  system.  Its  interval  op¬ 
erations,  obtained  by  including  the  file  intops. c,  are  given  in  Appendix  B; 
their  semantics  is  specified  in  Appendix  A. 

#include  <stdio.h> 

# include  <math.h> 

unsigned  oldmode, newmode, f pmode. () ; 

double  PLUINF,MININF; 

struct  interval  {double  l,r;)-; 
struct  interval  EMPTY  =  {1.0,  0.0}; 
struct  interval  ZERO  =  {0.0,  0.0}; 
struct  interval  ONE  *  {1.0,  1.0}; 
struct  interval  P0SINF,NEGINF; 

•define  KKNUTH  8 
•define  LKNUTH  8 
•define  BIGKKNUTH  256 

int  reverse [BIGKKNUTH] ; 

struct  interval  TWO  =  {2. 0,2.0}; 
struct  interval  INT256  =  {256.0,256.0}; 

struct  complex  {struct  interval  x,y;}  wpow [BIGKKNUTH] , 
w2pow[KKNUTH+l] ; 


mainO 

{ 
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void  doinitsO; 

cheu:  ♦litprintO; 

imsigned  char  *pd; 

uns igned  numO , numl , product ; 

int  i,j,k,rk; 

double  power 2; 

struct  interval  temp; 

struct  interval  intsumO  .intdiff  ()  , 

intprodO  ,intquot()  ,intsqrt(); 

struct  complex  s,t,f 1[BIGKKNUTH3 ,f2[BIGKKNUTH] ; 

doinitsO;  /*  form  infinite  constants  causing  no  exceptions 
for(i=0;  i  <  BIGKKNUTH;  ++i)  i 
k  =  i; 
rk  =  0; 

for(j=l;  j  <  KKNUTH;  ++j)  { 
if  (kftl) 
rk  +=  1; 
k  »=  1; 
rk  «=  1; 

} 

if (k&l) 
rk  +=  1; 

reverseCi]  =  rk; 

} 

w2pow[l] .x.l  =  -1.0; 
w2pow[t] .x.r  =  -1.0; 
w2pow[l] .y.l  =  0.0; 
w2pow[l] .y.r  =  0.0; 
w2powC2] .x.l  =  0.0; 
w2pow [2] . X . r  =  0.0; 
w2pow  [2] . y . 1  =  1.0; 
w2pow [2] . y . r  =  1.0; 
for(i*3;  i  <=  KKNUTH;  ++i)  { 
w2pow[i].x  =  intsqrtC 

intquotC 
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intsum(0NE,w2pow[i-l] .x) , 
TWO 
) 

); 

w2pow[i].y  =  intsqrtC 

intquotC 

intdiff (ONE,w2pow[i-l] .x) , 
TWO 
) 

); 

} 

w2pow[0]  =  w2pow [KKMUTH] ; 

for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
s.x  =  ONE; 
s.y  =  ZERO; 

j  =  i; 

k  =  0; 

while (j  !=  0)  { 
if(jftl)  { 

t.x  =  intdiff ( 

intprod(s.x,w2pow[KKNUTH-k] .x), 
intprod(s.y,w2pow[KKNUTH-k] .y) 
); 

t.y  =  intsumC 

intprodCs .x,w2pow[KKNUTH-k] .y), 
intprod(s .y,w2pow[KKNUTH-k] .x) 
); 

s  =  t; 

} 

j  »=  1; 

++k; 

} 

wpow[i]  =  s; 

} 


(void)  fprintf(stderr,” input  first  positive  number:  ") ; 
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(void)  scanf ("Xd",&numO) ; 

(void)  fprintf(stderr," input  second  positive  number:  "); 

(void)  scanf ("Xd" .ftnuml) ; 

power2  =  (double)  (l«(KKNUTH  +  LKNUTH));  I*  exact  *l 

for(i*0,pd  *  ((unsigned  cheir  *)  ftnumO)  +  3; 
i  <  4; 

++i,--pd)  { 

fl[i].x.l  =  ((double)  *pd)/power2;  /*  exact  */ 
fl[i].x.r  =  fl[i].x.l; 
fl[i].y.l  =  0.0; 
flCi] -y.r  =  0.0; 

} 

for(i=4;  i  <  BIGKKNUTH;  ++i)  { 
fl[i] .x.l  =  0.0; 
flCi] .x.r  =  0.0; 
fl[i] .y.l  =  0.0; 
flCi] ,y.r  =  0.0; 

> 

for(i*0,pd  =  ((unsigned  char  *)  &numl)  +  3; 
i  <  4; 

++i,"pd)  { 

f2[i].x.l  =  ((double)  *pd)/power2;  /*  exact  ♦/ 
f2[i] .x.r  =  f2[i] .x.l; 
f2[i].y.l  =  0.0; 
f2[i] .y.r  =  0.0; 

> 

for(i=4;  i  <  BIGKKNUTH;  ++i)  { 
f2[i] .x.l  =  0.0; 
f2[i] .x.r  =  0.0; 
f2[i] .y.l  =  0.0; 
f2[i] .y.r  =  0.0; 

} 

fft(fl); 

fft(f2); 
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for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
temp  =  fl[i].x; 

=  intdiff( 

intprod(temp,f2[i] .x) , 
intprod(f iCi] .y,f2[i] .y) 

); 

fl[i] .y  =  intsumC 

intprod(temp,f2[i] .y), 
intprodCf iCi] •y,f2[i] .x) 

); 

} 

fft(fl); 

f2[0].x.l  =  (flCO] .x.l)/( (double)  BIGKKNUTH);  /*  exact  ♦/ 
f2C0].x,r  =  (fl[03 .x.r)/ ((double)  BIGKKNUTH);  /*  exact  */ 
for(i=l;  i  <  BIGKKNUTH;  ++i)  { 

f2[i].x.l  *  (fl[BIGKKNUTH-i].x.l)/((double)  BIGKKNUTH); 
f2[i].x.r  *  (fl[BIGKKNUTH-il.x.r)/((double)  BIGKKNUTH); 

} 

power2  **  power2; 
for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
f2Ci].x.l  *=  power2; 
f2Ci] .x.r  *=  power2; 

} 

for(i»0;  i  <  BIGKKNUTH  -  1;  ++i)  i  /*  normalize  digits  */ 
while (f 2 [i] .x.r  >*  256.0)  { 

f2[i+l].x  =  intsum(f2[i+l] .x,0NE); 
f2[i].x  =  intdiff(f2[i3.x.INT256); 

} 

} 

(void)  printfC'X 

\n  For  the  input  values  %d  and  y.d\n\n”, 
numO.numl) ; 


/*  exact  ♦/ 

/*  exact  */ 
/*  exact  */ 
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product  *  numO  *  niml; 
for(i=3,pd  =  (unsigned  char  ♦)  ftproduct; 
i  >=  0; 

— i,++pd)  { 

(void)  printf("\ 

Actual  ba8e-256  digit  of  product:  Xd\n\ 
[y.20.16e  X20.16e]\nCC/.s)  (y.s)]\n". 

((int)  *pd), 
f2[i] .x.l,f2[i] .x.r, 

litprint(f2[i] .x.l) ,litprint(f2[i] .x.r)) ; 

} 

} 


f  f  t  (pa) 

struct  complex  pa[] ; 

{ 

int  i,j0,jl,k, indexO , index 1 , pow20 , pow2 1 ; 
struct  complex  eO,el,u,v; 

for(i«l;  i  <«  KKNUTH;  ++i)  { 
pow20  =  l«i: 
pow21  *  l«(KKNUTH-i); 
for (j 0*0;  jO  <  pow20;  j0+=2)  { 
jl  a  jO+1; 

eO  a  wpow  [reverse [j 0] ] ; 
el  a  wpow [reverse [j 1]] ; 
for(k=0;  k  <  pow21;  ++k)  { 
indexO  =  j0*pow21+k; 
indexl  =  indexO  +  pow21; 
u  =  pa [indexO] ; 
v  =  pa [indexl] ; 
pa [ indexO]. X  *  int sum ( 
u.x, 

intdif f ( 

intprod(e0.x,v.x) , 
intprod(e0.y ,v.y) 
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) 

); 

paCindexO] .y  =  intsumC 
u.y, 
intsum( 

intprod(e0.x,v.y) , 
intprod(eO . y , v .x) 

) 

); 

pa [ index 1] .X  =  intsumC 
u.x, 

intdif f ( 

intprod(el.x,v.x) , 
intprod(el.y,v.y) 

) 

); 

paCindexl] .y  =  intsumC 
u.y. 
intsumC 

intprodCel.x.v.y) , 
intprodCel.y.v.x) 

) 

); 

} 

} 

} 

forCi=0;  i  <  BIGKKNUTH;  ++i)  { 
k  =  reverse [i]; 
ifCi  >  k)  { 
u  =  pa[i]  ; 
pa[i]  =  pa[k]; 
paCk]  =  u; 

} 

} 

> 

tinclude  "intops. c" 
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C.2 


VAX  Interval  FFT  Multiply 


The  second  program  uses  VAX  double-precision  floating-point  arithmetic  as 
implemented  on  a  VAX  11/750.  Since  VAX  arithmetic  does  not  support 
different  rounding  modes,  its  interval  operations  produce  upper  and  lower 
bounds  on  computed  quantities  by  adding  or  subtracting  amounts  equal  to 
the  least  significant  mantissa  bits  of  these  quantities. 

#include  <stdio.h> 

#include  <math.h> 

#define  KKNUTH  8 
#define  LKNUTH  8 
#define  BIGKKNUTH  256 

int  reverse [BIGKKNUTH] ; 

struct  interval  {double  l,r;}; 
struct  interval  ZERO  =  {0.0,  0.0}; 
struct  interval  ONE  =  {1.0,  1.0}; 
struct  interval  TWO  =  {2. 0,2.0}; 
struct  interval  INT256  =  {256.0,256.0}; 

struct  complex  {struct  interval  x,y;}  wpow [BIGKKNUTH] , 
w2pow[KKNUTH+l] ; 


mainO 

{ 

char  ♦litprintO; 

unsigned  char  *pd; 

unsigned  numO , numl , product ; 

int  i , j , k , rk ; 

double  power2; 

struct  interval  temp; 

struct  interval  intsumO  ,  intdiff  ()  , 

intprodO  ,intquot()  ,intsqrt()  ; 
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struct  complex  s.t.f 1  [BIGKKNUTH] .f2[BIGKKmJTH] ; 

for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
k  =  i; 
rk  =  0; 

for(j=l;  j  <  KKNUTH;  ++j)  { 
if (k&l) 
rk  +=  1; 
k  »=  1; 
rk  «=  1; 

> 

if (k&l) 
rk  +=  1; 

reverse [i]  =  rk; 

} 

w2pow[l] .x.l  a  -1.0; 
w2pow[l] .x.r  =  -1.0; 
w2pow[l].y.l  *  0.0; 
w2pow[l] .y.r  *  0.0; 
w2pow[2] .x.l  *  0.0; 
w2pow[2] .x.r  s  0.0; 
w2powC2] .y.l  »  1.0; 
w2pow[2] .y.r  =  1.0; 
for(i=3;  i  <=  KKNUTH;  ++i)  { 
w2pow[i].x  =  intsqrt( 

intquot ( 

intsum(0NE,w2pow[i-l] .x) , 
TWO 
) 

) ; 

w2pow[i] .y  »  intsqrtC 

intquot ( 

intdiff (QNE,w2pow[i-l] .x) , 
TWO 
) 

); 
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} 

w2pow[0]  =  w2pow[KKNUTH] ; 

for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
s.x  =  ONE; 
s.y  =  ZERO; 

j  =  i; 

k  =  0; 

while (j  !=  0)  { 
ifCjftl)  { 

t.x  =  intdiffC 

intprod(s.x,w2pow[KKNUTH-k] .x), 
intprod(s.y,w2pow[KKNUTH-k] .y) 

); 

t.y  =  intsumC 

intprod(s.x,w2pow[KKNUTH-k] .y) , 
intprod(s.y,w2pow[KKNUTH-k] .x) 

): 

s  =  t; 

} 

j  »=  1; 

++k; 

} 

wpowCi]  =  s; 

} 

(void)  fprintf(stderr," input  first  positive  nxunber:  "); 
(void)  scanf  ("*/Cd"  ,&num0)  ; 

(void)  fprintf(stderr, "input  second  positive  number:  "); 
(void)  scanf  ("7,d"  ,&numl)  ; 

power2  =  (double)  (1«(KKNUTH  +  LKNUTH));  /*  exact  */ 
for(i=0,pd  =  (unsigned  char  *)  ftnumO; 
i  <  4; 

+-fi,++pd)  { 

fl[i].x.l  =  ((double)  *pd)/power2;  /♦  exact  */ 
fl[i]  .x.r  =  fl[i]  .x,l; 
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fl[i] .y.l  =  0.0; 
flCi] .y.r  =  0.0- 
> 

for(i=4;  i  <  BIGKKNUTH;  ++i)  { 
flCi] .x.l  =  0.0; 
fl[i] .x.r  =  0.0; 
flCi].y.l  =  0.0; 
fl[i] .y.r  =  0.0; 

} 

for(ia0,pd  =  (unsigned  char  *)  ftnuml- 
i  <  4; 

++i,++pd)  { 

f2[i].z.l  =  ((double)  *pd)/pow6r2; 
f2Ci].x.r  =  f2[i3.x.l; 

f2[i] .y.l  =  0.0; 
f2[i] .y.r  =  0.0; 

> 

for(i=4;  i  <  BIGKKNUTH;  ++i)  { 
f2Ci] .x.l  =  0.0; 
f2[i] .x.r  =  0.0; 

f2[i]  .y.l  r:  0.0; 

f2Ci] .y.r  =  0.0; 

} 

fft(fl); 

fft(f2); 

for(ia0;  i  <  BIGKKNUTH;  ++i)  { 
temp  a  fl[i] .x; 
flCi] .X  a  intdiff( 

intprod(temp,f2Ci] .x), 
intprod(f iCi] .y,f2[i] .y) 

)  I 

fl[i].y  *  intsuin( 

iiitprod(temp,f2[i3  .y) , 
intprod(fiCi].y,f2[i3.jc) 


*  exact  */ 
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fft(fl); 


f2[0].x.l  =  (fl[0] -x.l)/ ((double)  BIGKKNUTH);  /*  exact  */ 
f2[0].x.r  =  (fl[0] .x.r)/ ((double)  BIGKKNUTH);  /*  exact  *! 
for(i=l;  i  <  BIGKKNUTH;  ++i)  { 

f2Ci].x.l  =  (fl[BIGKKNUTH-i].x.l)/((double)  BIGKKNUTH); 
f2[i].x.r=  (flCBIGKKNUTH-i].x.r)/((double)  BIGKKNUTH); 
} 

power2  *=  power2;  /*  exact  */ 

for(i=0;  i  <  BIGKKNUTH;  ++i)  { 

f2[i].x.l  *=  power2;  /*  exact  */ 

f2[i].x.r  *=  power2;  /*  exact  */ 

} 


for(i=0;  i  <  BIGKKNUTH  -  1;  ++i)  {  /*  normedize  digits  ♦/ 

while (f 2 [i] .x.r  >=  256.0)  { 

f2[i+l].x  =  iiitsuin(f2[i+l]  .x,0NE)  ; 
f2Ci].x  =  intdiff (f2Ci] .x,INT256) ; 

} 

} 

(void)  printf("\ 

\n  For  the  input  values  '/,d  and  Xd\u\n'' .numO.numl)  ; 
product  *  numO  *  numl; 

for(i=3,pd  =  ((unsigned  char  *)  Aproduct)  +  3; 
i  >=  0; 

— i,— pd)  { 

(void)  printf("\ 

Actual  base-256  digit  of  product :  %d\n\ 

[•/.20.16e  •/.20.16e]\n[C/.s)  (y.s)]\n", 

((int)  ♦pd), 
f2[i] .x.l,f2[i] .x.r, 

litprint(f2 [i] .x.l) ,litprint(f2[i] .x.r)) ; 

> 

> 
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f  f  t  (pa) 

struct  complex  pa[] ; 

{ 

int  indexO , iudexl , pow20 , pow2 1 ; 

struct  complex  eO,el,u,v; 

for(i*l;  i  <=  KKNUTH;  ++i)  { 
pow20  =  l«i; 
pow21  =  l«(KKNUTH-i); 
for (j 0=0;  jO  <  pow20;  j0+=2)  { 
jl  =  jO+1; 

eO  =  wpow [reverse Cj 0]  ]  ; 
el  »  wpow [reverse [j 1]  ]  ; 
for(k=0;  k  <  pow21;  ++k)  { 
indexO  =  j0*pow21+k; 
index 1  =  indexO  +  pow21; 
u  =  pa [indexO]  ; 

V  =  pa[indexl] ; 
pa [indexO] .x  =  int sum ( 
u.x, 

intdif f ( 

intprod(e0.x,v.x) , 
intprod(e0.y ,v.y) 

) 

); 

pa [indexO] .y  =  int sum ( 

'i-y. 

int  sum  ( 

intprod(e0.x,v.y) , 
intprod(e0.y,v.x) 

) 

); 

pa[indexl] .x  =  intsumC 
u.x, 

intdif f( 

intprod(el.x,v.x) , 
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intprod(el . y , v . y) 

) 

); 

pa[indexl].y  =  intsiunC 
u.y, 
intsumC 

intprod(el.x,v.y) , 
intorodCel .y.v.x) 

) 

): 

} 

} 

} 

for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
k  =  reversed]; 
if(i  >  k)  { 
u  =  pa[i]  ; 
paCi]  =  pa[k]  ; 
pa[k]  =  u; 

> 

} 

} 


#define  downsumCX.Y)  decrement ((X)+(Y)) 
#define  upsnm(X,Y)  increment ((X)+(Y)) 

struct  interval  intsum(intl ,int2) 
struct  interval  int 1 , int2 ; 

{ 

double  decrement () , increment () ; 
struct  interval  intr; 

intr.l  =  downsum(intl. 1,10^2.1); 
intr.r  =  upsum(intl .r,int2.r) ; 
retum(intr) ; 

} 
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struct  interval  intdiff (intl,int2) 
struct  interval  intl,int2; 

{ 

double  decrement ( ) , increment ( ) ; 
struct  interval  intr; 

intr.l  =  downsum(intl .l,-int2.r) ; 
intr.r  ■  upsum(intl.r,-int2.1) ; 
return (intr) ; 

} 


tdefine  downprod(X,Y)  decrement ((X)*(Y)) 
fdefine  ttpprod(X,Y)  increment ((X)*(Y)) 

struct  interval  intprod(intl,int2) 
struct  interval  intl,int2; 

{ 

double  templ,tefflp2: 

double  decrement ( ) , increment ( } ; 

struct  interval  intr; 


ifCintl.l  >»  0.0)  { 
if(int2.1  >=  0.0)  { 

intr.l  =  downprod(intl.l,int2.1) ; 
intr.r  *  upprod(intl.r,int2.r) ; 

> 

else  -C 

if(int2.r  <«  0.0)  { 

intr.l  >  doHnprod(intl.r,int2.1) ; 
intr.r  ■  upprod(inti.l,int2.r) ; 

} 

else  { 

intr.l  ■  do«nprod(intl.r,int2.1) ; 
intr.r  •  upprod(inti.r,int2.r) ; 
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} 

} 

} 

else  •{ 

ifCintl.r  <=  0.0)  { 
if(int2.1  >«  0.0)  < 

intr.l  =  downprod(intl.l,int2.r) ; 
intr.r  =  upprod(intl.r,int2.1) ; 

else  { 

if(int2.r  <=  0.0)  { 

intr.l  =  downprod(intl.r,int2.r) ; 
intr.r  *  upprod(intl .l,int2.l) ; 

else  { 

intr.l  =  downprod(intl.l,int2.r) ; 
intr.r  =  upprod(intl.l,int2.1) ; 

} 

> 

else  { 

if(int2,l  >=  0.0)  { 

intr.l  =  downprod(intl.l,int2.r) ; 
intr.r  =  upprod(intl .r,int2.r) ; 

else  { 

if(int2.r  <=  0.0)  { 

intr.l  =  downprod(intl.r,int2.1) ; 
intr.r  =  upprod(intl.l,int2.1); 

} 

else  { 

tempi  =  downprod(intl.l,int2.r) ; 
temp2  =  downprod(intl.r,int2.1) ; 
intr.l  a  (tempi  <=  temp2)?templ;temp2; 
tempi  =  upprod(intl.l,int2.1) ; 
temp2  «  upprod(intl.r,int2.r); 
intr.r  «  (tempi  >«  temp2)?templ:temp2; 
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} 

} 

} 

} 

retum(iiitr) ; 

} 


tdefine  dovnquot(X,Y)  decrement ((X)/(Y)) 
idefine  upqnot(X,Y)  increment ((X)/(Y)) 

struct  interval  intquot(intl,int2) 
struct  interval  intl,int2; 

double  decrement  0 , increment () ; 
struct  interval  intr; 


if(int2.1  <«  0.0  int2.r  >=  0.0)  { 

(void)  fprintf(stderr, ‘'interval  division  by  zero\n"); 
ezit(l) ; 

} 


ifCintl.l  >=  0.0)  { 
if(int2.1  >  0.0)  { 

intr.l  =  doanquot(intl.l,int2.r) ; 
intr.r  =  upquot(intl.r,int2.1) ; 

} 

else  { 

intr.l  *  downquot(intl.r,int2.r) ; 
intr.r  =  upquot(intl.l,int2.1) ; 

} 

} 

else  { 

if(intl.r  <■  0.0)  { 
if(int2.1  >  0.0)  { 


intr.l  ■  dovnquot(intl.l,int2.1) ; 
intr.r  ■  upquot(intl .r,int2.r) ; 
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} 

else  { 

intr.l  =  doTOquot(intl.r,int2.1) ; 
intr.r  =  upquot(intl.l,int2.r); 

} 

} 

else  { 

if(int2.1  >  0.0)  { 

intr.l  =  downquot(intl.l,int2.1) ; 
intr.r  =  upquot(intl.r,int2.1) ; 

} 

else  { 

intr . 1  =  downquot ( inti . r , int2 . r) ; 
intr.r  =  upquot(intl.l,int2.r) ; 

} 

} 

} 

return (intr) ; 

} 


struct  interval  intsqrt(intx) 
struct  interval  intx; 

double  increment  0 , decrement () ; 
struct  interval  result; 

if (intx. 1  >  intx.r  M  intx.r  <  0.0)  { 

(void)  fprintf(stderr, "square  root  of  negative  quantityXn") ; 
exit(l) ; 

} 

else  { 

if (intx. 1  <  0.0) 
result. 1  =  0.0; 
else 

result. 1  =  sqrt(intx.l) ; 
result. r  =  sqrt(intx.r) ; 
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if (result. 1  > 
result. 1  = 
if (result. r  > 
result. r  = 

> 

return (result) ; 

} 


0.0) 

decrement (result . 1) ; 
0.0) 

increment (result . r) ; 


#define  MAX  1.70 141 18346046923e+38 
#define  NEGMAX  -1 .7014118346046923e+38 
•define  MIN  2. 9387358770557 188e-39 
•define  NEGMIN  -2.9387358770557l88e-39 


double  increment (x) 
double  x; 

unsigned  char  *pexpl0,*pexpll,*pexp20,*pexp21; 
double  mini; 

if(x  ==  MAX)  { 

(void)  fprintf (8tderr,"\ 
attempt  to  increase  leo'gest  valueXn"); 
exit(l) ; 

} 

if(x  -«  0.0) 
return (MIN) ; 

if(x  »  NEGMIN) 
return (0.0) ; 

mini  *  0.0; 

pexplO  ■  ((unsigned  char  *)  Ax); 

pexpll  =  pexplO  +  1; 

pexp20  ■  ((unsigned  char  *)  ftminl); 
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pexp21  ■  pexp20  +  1; 
’•<pexp20  >  *pezplO  ft  128; 
’•■pezp21  ■  ♦pexpll  ft  127; 
if(*pexp21  >=  28)  { 
♦pexp21  28; 

X  +=  2.0  ♦  mini; 

} 

else  { 

♦pexpll  +=  28; 

X  +■  2.0  *  mini; 
•pexpll  -=  28; 

} 

return (x) ; 

} 


double  decrement (x) 
double  x; 

{ 

unsigned  char  *pexpl0,*pexpll,*pexp20,*pexp21; 
double  mini; 


/♦  modifies  mini  —  remember,  •/ 
/♦  leading  mantissa  1  implicit  */ 

/♦  mini  is  now  heilf  the  value  •/ 
/♦of  the  least  bit  of  x  ♦/ 

/♦  scale  X  to  avoid  underflow  ♦/ 
/♦  mini  again  half  least  bit  ♦/ 

/♦  scale  X  back  ♦/ 


if(x  ==  NEGMAX)  { 

(void)  fprintf (stderr,''\ 

attempt  to  decrease  most  negative  value\n") ; 
exit(l); 

} 

if(x  »*  0.0) 

return (NEGMIN) ; 

if(x  «=  MIN) 
return (0.0) ; 

mini  •  0.0; 

pexplO  ■  ((\insigned  char  ♦)  ftx); 
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pexpll  *  pexplO  +  1; 
pexp20  =  ((imsigned  char  ♦) 
pexp21  =  pexp20  +  1; 

■*‘pexp20  ■  *pexplO  ft  128;  /♦ 
*pexp21  «  *pexpll  ft  127;  /* 


if(^pexp21  >*» 
*pexp21  -= 

28)  { 

28; 

/♦ 

X  -■  2.0  ♦ 

mini ; 

/♦ 

} 

else  -( 

♦pexpll  += 

28; 

/♦ 

X  -■  2.0  ♦ 

mini; 

/♦ 

♦pexpll  -= 

28; 

/♦ 

} 


ftffiinl) ; 

modifies  mini  —  remember,  */ 
leading  mantissa  1  implicit  ♦/ 

mini  is  now  half  the  value  */ 
of  the  least  bit  of  x  */ 

scale  X  to  avoid  underflow  */ 
mini  again  half  least  bit  */ 
scale  X  back  */ 


return (x) ; 

} 


/♦  LITKAX  should  be  as  large  as  the  largest  number  of  */ 
/♦  calls  to  litprint  in  a  single  invocation  of  printf  ♦/ 

«define  LITHAX  10 
•define  LITLNGTH  20 

char  ♦litprint (x) 
double  x; 

{ 

static  char  litstrings[LITMAX] [LITLNGTH] ; 
static  int  litindex  *  0;  /*  compiler  init  needed  */ 
char  ♦pstring,*8printf 0 ; 
unsigned  short  ♦px; 

px  >  (unsigned  short  ♦)  ftx; 
pstring  ■  litstrings [litindex] ; 

(void)  sprintf (pstring, "X04x  %04x  */.04x  X04x", 
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*px , * (px+ l),*(px+2),* (px+3 ) ) ; 


++1 it index; 

if (lit index  ==  LITKAX) 
lit index  =  0; 

return (pstring) ; 

} 


C.3  Scalar  FFT  Multiply 

The  final  program  implements  an  ordinary  floating-point  version  of  an  algo¬ 
rithm  to  use  Fast  Fourier  Transforms  to  multiply  large  integers.  Although 
the  program  as  it  is  given  is  written  to  run  on  a  Sun,  comments  in  the  code 
describe  the  simple  adaptations  needed  to  have  it  produce  similar  results  on 
the  VAX. 

# include  <stdio.h> 

# include  <math.h> 

#define  KKNUTH  8 
#define  LKNUTH  8 
#define  BIGKKNUTH  256 

int  reverse [BIGKKNUTH] ; 

struct  complex  {double  x,y;}  wpow [BIGKKNUTH] , 

w2pow[KKNUTH+l] ; 


mainO 

{ 

char  *litprint(); 
unsigned  char  *pd; 
unsigned  numO , niun  1 , product ; 
int  i.j.k.rk; 
double  power2,temp; 
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struct  complex  s.t.flCBIGKKNUTH] ,f2[BIGKKNUTH] ; 

for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
k  =  i; 
rk  *  0; 

for(j=l;  j  <  KKNUTH;  ++j)  { 
if (k&l) 
rk  +=  1; 
k  »=  1; 
rk  «=  1; 

} 

if  (kftl) 
rk  +=  1; 

reverse [i]  =  rk; 

} 

w2pow[l] .X  =  -1,0; 
w2pow[l] .y  *  0.0; 
u2pow [2] , X  =  0.0; 
w2pow [2] . y  =  1.0; 
for(i«3;  i  <=  KKNUTH;  ++i)  { 

w2pow[i],x  =  sqrt((1.0  +  w2pow[i-l] .x)/2.0) ; 
w2pow[i].y  =  sqrt((1.0  -  w2pow[i-l] .x)/2.0) ; 
} 

w2pow[0]  =  w2pow [KKNUTH] ; 

for(i=0;  i  <  BIGKKNUTH;  ++i)  { 
s.x  =  1,0; 
s.y  »  0,0; 

j  “  i; 

k  «  0; 

while (j  !=  0)  { 
if(jftl)  { 

t.x  =  s.x  *  w2pow[KKNUTH-k] .X  - 
s.y*  w2pow [KKNUTH-k] . y ; 
t.y  =  s.x  *  w2pow [KKNUTH-k] .y  + 
s.y  ♦  w2pow[KKNUTH-k] .x; 
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s  =  t; 

} 

j  »=  1; 

++k; 

} 

wpow[i]  *  s; 

> 

(void)  fprintf(stderr," input  first  positive  number:  "); 

(void)  scanf ("Xd" .ftnumO) ; 

(void)  fprintf(stderr, "input  second  positive  ninnber:  "); 

(void)  scanf  ("/id"  ,&numl)  ; 

power2  =  (double)  (1«(KKNUTH  +  LKNUTH)); 

for(i=0,pd  =  ((unsigned  char  *)  &niimO)  +  3;  /*  See  Note  */ 
i  <  4; 

++i, — pd)  {  /*  See  Note  */ 

fl[i].x  =  ((double)  *pd)/power2; 
flCi].y  =  0.0; 

} 

for(i=4;  i  <  BIGKKNUTH;  ++i)  { 
fl[i] .X  =  0.0; 
flCi] .y  =  0.0; 

} 

for(i=0,pd  =  ((unsigned  char  *)  ftnuml)  +  3;  /*  See  Note  ♦/ 
i  <  4; 

++i, — pd)  {  /*  See  Note  */ 

f2[i] .X  =  ((double)  ♦pd)/power2; 
f2Ci]  .y  =  0.0; 

} 

for(i=4;  i  <  BIGKKNUTH;  ++i)  { 
f2[i] .X  =  0.0; 
f2[i]  .y  =  0.0; 

} 

fft(fl); 

fft(f2): 
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for(i-0;  i  <  BIGKKNUTH;  ++i)  { 
temp  »  f 1 [i] .x; 

fl[i].x  =  temp  *  f2[i3.x  -  fl[i].y  *  f2[i].y; 
flCi].y  =  temp  *  f2[i].y  +  fl[i3.y  *  f2[i].x; 

} 

fft(fl); 

f2[0].x  =  (fl[0] .x)/( (double)  BIGKKNUTH); 
for(i=l;  i  <  BIGKKNUTH;  ++i) 

f2[i].x-  (flCBIGKKNUTH-i] .x) /((double)  BIGKKNUTH); 

power2  **  power2; 
for(i=0;  i  <  BIGKKNUTH;  ++i) 
f2[i] .X  ♦=  power2; 

for(i=0;  i  <  BIGKKNUTH  -  1;  ++i)  {  /*  normalize  */ 
while(f2[i] .X  >=  256,0)  { 
f2[i+l] .X  +=  1.0; 
f2[i] .X  -=  256.0; 

} 

} 

(void)  printfC'X 

\n  For  the  input  vailues  Xd  and  %d\n\n", 
numO,numl) ; 

product  *  numO  *  niunl; 

for(i»3,pd  ■  (unsigned  char  *)  ftproduct;  /*  See  Note  */ 
i  >«  0; 

— i,++pd)  {  /*  See  Note  */ 

(void)  printf("\ 

Actual  ba8e-256  digit  of  product:  %d\n\ 
X20.16e\n(X8)\n’', 

((int)  ♦pd) ,f2[i] .x,litprint(f2[i] .x)) ; 

} 

} 
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/♦  Note:  On  the  VAX,  the  individual  bytes  of  a  floating 
point  value  are  addressed  differently.  The  three  loops 
noted  above  should  be  changed  to 

for(i*0,pd  «  ((unsigned  char  ♦)  ftnumO); 
i  <  4; 

++i , ++pd)  { 

for(i=0,pd  =  ((unsigned  char  *)  ftnuml) ; 
i  <  4; 

++i,++pd)  { 

for(i*3,pd  =  (unsigned  char  *)  iproduct  +  3; 
i  >=  0; 

— i,— pd)  { 

to  mahe  the  program  behave  the  same  on  the  VAX  as  it 
does  on  the  Suns.  */ 


fft (pa) 

struct  complex  pa[]; 

{ 

int  i,j0,jl,k, indexO , index 1 , pow20 , pow2 1 ; 
struct  complex  e0,el,u,v; 

for(i=l;  i  <=  KKNUTH;  ++i)  { 
pow20  =  l«i; 
pow21  »  l«(KKNUTH-i); 
for (j 0*0;  jO  <  pow20;  j0+*2)  { 
jl  *  jO+1; 

eO  *  wpow [reverse [ j  0] ] ; 
el  *  wpow [reverse [j 1] ] ; 
for(k=0;  k  <  pow21;  ++k)  { 
indexO  =  j0*pow21+k; 
indexl  =  indexO  +  pow21; 
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> 


} 


u  =  pa[indexO] ; 
V  =  paCindezl] ; 
pa[indexO].x  = 
paCindexO] .y  = 
pa [ index 1] .x  = 
pa[indexl] .y  = 


u.x  +  e0.x*v.x  - 
u.y  +  e0.x*v.y  + 
u.x  +  el.x*v.x  - 
u.y  +  el.x*v.y  + 


e0.y*v.y; 

e0.y*v.x; 

el.y*v.y; 

el.y*v.x; 


for(i=0;  i  <  BIGKKNUTH; 
k  =  reverse [i]; 
if(i  >  k)  { 
u  =  pa[i]  ; 
pa[i]  =  pa[k]; 
paCk]  =  u; 

} 

} 

> 


++i)  { 


/♦  LITMAX  should  be  as  large  as  the  largest  number  of  */ 
/*  calls  to  litprint  in  a  single  invocation  of  printf  */ 

#define  LITMAX  10 
#define  LITLNGTH  20 

ch2u:  *litprint(x) 
double  x; 

{ 

static  char  litstrings [LITMAX]  [LITLNGTH] ; 

static  int  lit index  =0;  /*  compiler  init  needed  */ 

char  ♦pstring,*sprintf () ; 

unsigned  short  *px; 

px  ■  (unsigned  short  *)  ftx; 
pstring  >  litstrings [lit index] : 
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(void)  sprintf  (pstring,''%04x  %04x  X04x  */J04x", 
*px , * (px+ 1 ) , * (px+2) , * (px+3) ) ; 


++1 it index; 

ifditindex  ==  LITMAX) 
litindex  =  0; 

return (pstring) ; 

> 
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Appendix  D 

FFT  Multiply  Comparisons 


This  appendix  contains  an  edited  collection  of  output  from  the  interval  and 
scalar  versions  of  the  programs  given  in  Appendix  C,  programs  that  use  Fast 
Fourier  Transforms  to  multiply  large  integers.  In  all  cases,  results  are  first 
given  in  decimal  form,  then  as  hexidecimal  descriptions  of  the  actual  bit 
patterns  that  code  the  double- precision  interval  endpoints  or  scalar  values 
on  the  machine  being  used.  These  results  are  discussed  in  Chapters  2  and  3. 

The  Sun  interval  results  were  obtained  using  upwardly-  or  downwardly- 
rounded  IEEE  double-precision  floating-point  values,  as  implemented  on  a 
Sun  3/60  having  an  MC68881  coprocessor  with  mask  A93N,  as  interval  end¬ 
points.  The  VAX  interval  results  were  obtained  by  forming  uppef  and  lower 
bounds  on  computed  quantities  by  adding  or  subtracting  amounts  equal  to 
the  least  significant  mantissa  bits  of  these  computed  quantities.  The  VAX 
interval  computations  were  made  on  a  VAX  11/750.  The  Sun  scalar  results 
were  obtained  using  the  IEEE  default,  round-to-nearest  rounding  mode  on 
the  same  Sun  3/60  m2u:hine.  The  VAX  scalar  results  were  obtained  using  the 
only  rounding  mode  available  to  VAX  arithmetic,  which  is  similar  to  round- 
to-nearest  but  rounds  to  the  value  with  larger  magnitude  rather  than  the 
value  with  least  significant  bit  0  when  two  representable  values  are  equally 
close!  to  a  nonrepresentable  intermediate  result.  The  VAX  scalar  computa¬ 
tions  were  made  on  the  same  VAX  11/750. 

The  edited  output  begins  on  the  next  page. 
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For  the  input  values  17  and  34 


Actual  base-256  digit  of  product:  0 

[O.OOOOOOOOOOOOOOOOe+00  0 .OOOOOOOOOOOOOOOOe+00]  (IEEE) 
[(8000  0000  0000  0000)  (0000  0000  0000  0000)] 
[-3.08041184097410146-13  3.0804118409741014e-13]  (VAX) 
[(abad  6969  c640  0b7b)  (2bad  6969  c640  0b7b)] 
O.OOOOOOOOOOOOOOOOe+00  (IEEE) 

(0000  0000  0000  0000) 

O.OOOOOOOOOOOOOOOOe+00  (VAX) 

(0000  0000  0000  0000) 


Actual  base-256  digit  of  product:  0 

[O.OOOOOOOOOOOOOOOOe+00  0 .OOOOOOOOOOOOOOOOe+00]  (IEEE) 
[(8000  0000  0000  0000)  (0000  0000  0000  0000)] 
i:-2.6524986067235032e-13  2,6524986067235032e-13]  (VAX) 
[(ab95  5288  973f  f980)  (2b9S  5288  973f  f980)] 
O.OOOOOOOOOOOOOOOOe+00  (IEEE) 

(0000  0000  0000  0000) 

O.OOOOOOOOOOOOOOOOe+00  (VAX) 

(0000  0000  0000  0000) 


Actual  base-256  digit  of  product :  2 

[2.00000000000000006+00  2.00000000000000006+00]  (IEEE) 
[(4000  0000  0000  0000)  (4000  0000  0000  0000)] 
[1.99999999999975486+00  2 .0000000000002452e+00]  (VAX) 
[(40ff  ffff  ffff  dd7f)  (4100  0000  0000  1141)] 
2.00000000000000006+00  (IEEE) 

(4000  0000  0000  0000) 

2.00000000000000006+00  (VAX) 

(4100  0000  0000  0000) 

Actual  base-256  digit  of  product:  66 
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[6.6000000000000000G+01  6.6000000000000000e+01] 
[(4050  8000  0000  0000)  (4050  8000  0000  0000)] 

[6 . 5999999999999494e+0 1  6 . 6000000000000506e+0 1] 
[(4383  ffff  ffff  fee3)  (4384  0000  0000  Olid)] 

6 . 6000000000000000e+01 
(4050  8000  0000  0000) 

6 . 6000000000000000e+0 1 
(4384  0000  0000  0000) 


For  the  input  values  8724  and  9683 

Actual  base-256  digit  of  product:  5 

[4.9999999999685540e+00  5 .0000000000314415e+00] 
[(4013  ffff  ffff  75b3)  (4014  0000  0000  8a48)] 
[4.9999999999822584e+00  5.0000000000177497e+00] 
[(419f  ffff  fffd  8fc6)  (41a0  0000  0002  7083)] 

4 . 9999999999999734e+00 
(4013  ffff  ffff  ffe2) 

5 . OOOOOOOOOOOOOOOOe+00 
(41a0  0000  0000  0000) 


Actual  base-256  digit  of  product:  8 

[7.9999999999729425e+00  8.0000000000275122e+00] 
[(401f  ffff  ffff  8900)  (4020  0000  0000  3c80)] 

[7 . 9999999999841335e+00  8 .0000000000161793e+00] 
[(41ff  ffff  fffd  dlbf)  (4200  0000  0001  leal)] 

8 . 0000000000002274e+00 
(4020  0000  0000  0080) 

8 . 0000000000001705e+00 
(4200  0000  0000  0300) 

Actual  base-256  digit  of  product:  250 


(IEEE) 

(VAX) 

(IEEE) 

(VAX) 


(IEEE) 

(VAX) 

(IEEE) 

(VAX) 

(IEEE) 

(VAX) 

(IEEE) 

(VAX) 
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[2.4999999999994088e-f02  2.5000000000006276e402]  (IEEE) 
[(406f  3fff  ffff  f7e0)  (406f  4000  0000  08a0)] 
[2.49999999999965296+02  2.5000000000003619e+02]  (VAX) 
[(4479  ffff  ffff  d9d5)  (447a  0000  0000  27cb)] 

2 . 50000000000000006+02  (IEEE) 

(406f  4000  0000  0000) 

2 . 50000000000000916+02  (VAX) 

(447a  0000  0000  0100) 

Actual  bas6-256  digit  of  product:  124 

[ 1 . 23999999999984546+02  1 . 24000000000015466+02] 

[(4056  ffff  ffff  fbcO)  (405f  0000  0000  0440)] 

[1 . 23999999999990076+02  1 . 24000000000010386+02] 

[(43f7  ffff  ffff  6a2b)  (43f8  0000  0000  16d5)] 
1.24000000000000006+02 
(405f  0000  0000  0000) 

1 . 24000000000000116+02 
(43f8  0000  0000  0040) 


For  th6  input  valu6s  9473281  and  6734529 

Actual  bas6-256  digit  of  product:  38 

[3 . 79999999994470276+0 1  3 . 80000000005748016+0 1] 

[(4042  ffff  fff6  dOOO)  (4043  0000  0001  3c00)] 

[3 . 79999999994446036+0 1  3 . 80000000005644926+01] 

[(4317  ffff  fff6  7557)  (4318  0000  0009  b2a9)] 

3 . 80000000000072766+0 1 
(4043  0000  0000  0400) 

3 . 80000000000063666+0 1 
(4318  0000  0000  IcOO) 

Actual  ba86-256  digit  of  product:  59 

[5.89999999995488916+01  5.90000000004656616+01]  (IEEE) 


(IEEE) 

(VAX) 

(IEEE) 

(VAX) 


(IEEE) 

(VAX) 

(IEEE) 

(VAX) 
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[(404d  7fff  ffff  0800)  (404d  8000  0001  0000)] 
[5.89999999995419196+01  5 .9000000000467176e+01] 
[(436b  ffff  fff8  2157)  (436c  0000  0008  06a9)] 

5 . 90000000000000006+01 
(404d  8000  0000  0000) 

5 . 90000000000063666+01 
(436c  0000  0000  IcOO) 


Actual  bas6-256  digit  of  product:  15 

[1.49999999995707196+01  1.50000000004401956+01] 
[(402d  ffff  fffc  5000)  (4026  0000  0003  c800)] 
[1.49999999997488306+01  1.50000000002561726+01] 
[(426f  ffff  ff66  bd5f)  (4270  0000  0011  9aal)] 
1.50000000000036386+01 
(4026  0000  0000  0800) 

1 . 50000000000031836+0 1 
(4270  0000  0000  3800) 


Actual  ba36-256  digit  of  product:  193 

[1.92999999999827826+02  1.93000000000172496+02] 
[(4068  Ifff  ffff  6856)  (4068  2000  0000  17b5)] 

[1 . 92999999999906356+02  1 . 9300000000009354e+02] 
[(4440  ffff  ffff  9907)  (4441  0000  0000  66d8)] 

1 . 92999999999999776+02 
(4068  Ifff  ffff  fff8) 

1.93000000000000086+02 
(4441  0000  0000  0017) 
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(VAX) 

(IEEE) 

(VAX) 

(IEEE) 

(VAX) 

(IEEE) 

(VAX) 

(IEEE) 

(VAX) 

(IEEE) 

(VAX) 


Appendix  E 

Interval— Newton’s  Code 


The  following  code  implements  Alefeld  and  Herzberger’s  [Ale86]  Interval 
Newton’s  Method,  discussed  in  Chapter  2,  for  IEEE  double- precision  floating¬ 
point  arithmetic  as  implemented  on  a  Sun  3/60  having  an  MC68881  copro¬ 
cessor  with  mask  A93N.  Its  interval  operations  force  results  to  be  rounded 
correctly  by  calling  f  pmode.  under  Release  3.5  of  the  Sim  UNIX  4.2  operating 
system.  Its  interval  operations,  obtained  from  the  file  intops .  c,  are  given 
in  Appendix  B;  their  semantics  is  specified  in  Appendix  A. 

^include  <stdio.h> 

#include  <math.h> 

unsigned  oldmode, newmode, f pmode_ () ; 

double  PLUINF.MININF; 

struct  interval  {double  l.r;}; 
struct  interval  EMPTY  =  {1.0,  0.0}; 
struct  interval  TWO  *  {2.0,  2.0}; 
struct  interval  P0SINF,NEGINF; 

«define  MAXDEG  10 

int  degree; 


168 


double  coeff  CMAXOEG-fl]  ; 


void  nainO 

{ 

void  doinitsO; 
char  *litprint(); 
int  i; 

struct  interval  intlengthO , midpoint ()  .intinterO  ; 
struct  interval  intdiff () ,intquot() ; 
struct  interval  intfO; 

struct  interval  start, intx.intm, middle, neulength.oldlength; 
doinitsO ; 

(void)  fprintf(stderr,*' input  degree  <=  /id  of  polynomial  f: 
MAXDEG) ; 

(void)  scanf(")id”,Adegree); 
for(i»degree;  i  >*  0;  ~i)  { 

(void)  fprintf(stderr,'' input  coefficient  of  x**/id:  ",i); 

(void)  scanf ("Xlf'.tcoeff [i]) ; 

> 

(void)  fprintf(stderr, "input  point  x  at  which  f(x)  <  0:  "); 
(void)  scanf("Xlf'',ftstart.l); 

(void)  fprintf(stderr, "input  point  x  at  which  f(x)  >  0:  "); 
(void)  scanf ("y.lf",ftsta^t.^); 

(void)  fprintf (stderr, 

"For  the  next  two  inputs,  x  ranges  over  the  interval \n\ 
just  input,  and  r  is  a  root  of  f  in  this  interval  An") ; 
(void)  fprintf (stderr, 

"input  positive  lower  bound  on  f(x)/(x-r):  ") ; 

(void)  scanf ("Xlf ".Aintm.l) ; 

(void)  fprintf (stderr, 

"input  positive  upper  bound  on  f(x)/(x-r):  ") ; 

(void)  scanf ("Xlf",*intB.r) ; 


intx  ■  start; 

newlength  ■  intlength(intx) ; 
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do  •( 

oldlength  =  newlengtb; 
middle  =  midpoint (intx) ; 
intx  =  int inter ( 
intx, 
intdiff ( 
middle, 
intquot ( 

intf (middle) , 
intm 
) 

) 

); 

newlength  =  intlength(intx) ; 

}  while (newlength. r  <  oldlength. 1) ; 

(void)  printf("\n  For  the  polynomiail  f(x)  =  \n"); 
for(i*degree;  i  >=  0;  — i)  { 

(void)  printf("  5(20. 16e  x''5(d'',coeff  [i]  ,i)  ; 

if(i  >  0) 

(void)  printfC  +\n")  ; 
else 

(void)  printfC  ,\n"); 

} 

(void)  printfC  if  the  interval\n\ 

[•/,20.16e  7,20.16e]\n\ 

contains  a  root  r,  and  for  every  x  in  the\n\ 
interved  it  is  true  that  f(x)/(x-r)  belongs\n\ 
to  the  interval\n\ 

[•/.20.16e  5(20.16e],\n\ 

then  a  root  of  f  is  contained  in  the  interval :\n\n\ 
[•/.20.16e  7.20.16e],  i.e.,\n\ 

C(7.s)  (7.s)].\n\n'-, 

start.l,start.r, 
intm . 1 , intm . r , 
intx. 1, intx. r, 

litprint(intx. 1) ,litprint(intx.r)) ; 
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> 


struct  interval  intf(intx) 
struct  interval  intx; 

{ 

double  low.high.f 0 ; 
struct  interval  result ; 

newmode  =  2*64  +  2*16; 
oldmode  =  f pmode_ (ftnewmode) ; 
low  =  f(inti.l); 
high  =  f(intx.r); 

result. 1  =  (low  <*  high) ?low: high; 
newmode  =  2*64  +  3*16; 
newmode  =  fpmode_(&newmode) ; 
low  =  f(intx.l); 
high  =  f(intx.r); 

result. r  =  (high  >=  low)?high:low; 
newmode  *  fpmode_(&oldmode) ; 

return (result) ; 

} 

double  f(x) 
double  x; 

int  i; 
double  y; 

y  =  0.0; 

for(i«degree;  i  >=  O;  — i)  { 

y  *«  x; 

y  +=  coeffCi]; 

} 

return (y) ; 

> 
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struct  interval  midpoint (intx) 
struct  interval  intx; 

{ 

struct  interval  leftendO ,rightend() , 
intsumO  ,  intquot  ()  ; 
struct  interval  result; 

result  =  intquot ( 
intsiun( 

lef tend (intx) , 
right end (intx) 

). 

TWO 

); 


return (result) ; 

} 


#include  "intops. c 


Appendix  F 

Interval— Newton’s  Results 


This  appendix  contains  output  results  from  the  program  implementing  the 
interval  version  of  Newton’s  Method  developed  by  Alefeld  and  Herzberger 
[Ale86]  and  given  in  Appendix  E.  All  calculations  were  performed  on  a  Sun 
3/60  having  an  MC68881  coprocessor  with  mask  A93N.  All  calculations  were 
performed  in  the  IEEE  default  round-to-nearest  rounding  mode.  These  re¬ 
sults  are  discussed  in  Subsection  2.3.2. 

By  way  of  comparison,  the  arbitrary-precision  constructive-real  calculator 
developed  by  Boehm  [Boe87]  gives  the  following: 

^/2  «  1.414213562373095048801688724, 

^  «  1.442249570307408382321638311,  and 
^  «  1.379729661461214832390063464. 

For  the  polynomial  f(x)  ® 

l.OOOOOOOOOOOOOOOOe+00  x''2  + 

O.OOOOOOOOOOOOOOOOe+00  x*l  + 

-2.0000000000000000e+00  x“0  , 
if  the  interval 

[1 . OOOOOOOOOOOOOOOOe+00  2 . OOOOOOOOOOOOOOOOe+00] 
contains  a  root  r,  and  for  every  x  in  the 
interval  it  is  true  that  f(x)/(x-r)  belongs 
to  the  interval 
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[2.00000000000000006+00  4.00000000000000006+00] , 
then  a  root  of  f  is  contained  in  the  interval: 

[1.41421356237309496+00  1 .4142135623730951e+00] ,  i.e 
[(3ff6  a09e  667f  3bcc)  (3ff6  a09e  667f  3bcd)] . 


For  the  polynomial  f(x)  = 

l.OOOOOOOOOOOOOOOOe+00  x“3  + 

O.OOOOOOOOOOOOOOOOe+00  x“2  + 

O.OOOOOOOOOOOOOOOOe+00  x“l  + 

-3.00000000000000006+00  x“0  , 
if  the  interval 

[1 . OOOOOOOOOOOOOOOOe+00  2 . OOOOOOOOOOOOOOOOe+00] 
contains  a  root  r,  and  for  every  x  in  the 
interval  it  is  true  that  f(x)/(x-r)  belongs 
to  the  interval 

[3.00000000000000006+00  1.20000000000000006+01] , 
then  a  root  of  f  is  contained  in  the  interval: 

[1.44224957030740836+00  1 .4422495703074085e+00] ,  i.e 
[(3ff7  1374  4912  3ef6)  (3ff7  1374  4912  3ef7)] . 


For  the  polynomial  f(x)  = 

l.OOOOOOOOOOOOOOOOe+00  x'5  + 
O.OOOOOOOOOOOOOOOOe+00  x*4  + 
O.OOOOOOOOOOOOOOOOe+00  x‘3  + 
O.OOOOOOOOOOOOOOOOe+00  x"2  + 
O.OOOOOOOOOOOOOOOOe+00  x“l  + 
-5.0000000000000000e+00  x“0  , 
if  the  interv.al 

[1 . OOOOOOOOOOOOOOOOe+00  2 . OOOOOOOOOOOOOOOOe+00] 
contains  a  root  r,  and  for  every  x  in  the 
interval  it  is  true  that  f(x)/(x-r)  belongs 
to  the  interval 

[5.0000000000000000e+00  8.0000000000000000e+0l] , 
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then  a  root  of  f  is  contained  in  the  interval: 

[1.3797296614612147e+00  1 .3797296614612149e+00] ,  i.e., 
[(3ff6  135f  68d4  cOcb)  (3ff6  135f  68d4  cOcc)] . 
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Appendix  G 

A  Correctness  Difficulty 


As  we  explain  in  Section  2.4,  the  following  C  program  is  a  counterexample  to 
the  spirit  of  our  conjecture  that  programs  using  intervzJ  arithmetic  that  are 
provably  asymptotically  correct  are  also  effectively  asymptoticaUy  correct. 
The  program  computes  ir  using  Machin’s  formula  [BB87], 

i=4arctan(i)-arcl»n(^), 

and  the  power  series 


#include  <stdio.h> 

mainO 

{ 

double  m,pow5,pow239; 

double  low, high, oldlov,oldhigh; 

low  =  0.0; 

high  *  16.0/5.0  -  4.0/239.0; 
m  *  1.0; 
pow5  ■  5.0; 
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pow239  =  239.0; 
do  { 

oldlow  =  low; 
m  +*  2.0; 
pow5  *=  25.0; 
pow239  ♦=  57121.0; 

low  =  high  -  16.0/(pow5*m)  +  4. 0/(pow239*m) ; 

oldhigh  =  high; 
m  +*  2.0; 
pow5  *=  25.0; 
pow239  *=  57121.0; 

high  =  low  +  16.0/(pow5*m)  -  4.0/(pow239*m) ; 

}  while (low  <  high  kk  low  >  oldlow  kk  high  <  oldhigh) ; 
(void)  printf( 

"\n  Computed  approximations  to  bounds  on  pi:\n\n\ 
lower  bound  —  y.20.16e\n\ 
upper  bound  —  •/.20. 16e\n\n'', low, high)  ; 
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Appendix  H 

Cediban  Continued  Fractions 


The  following  Caliban  code  [BMS89],  discussed  in  Subsection  6.4.1,  computes 
and  displays  standard  and  generalized  continued  fractions.  If  the  user  sim¬ 
plifies  the  expression  getcfrac  n  in  the  Clio  prover  [BMS89],  where  n  is  a 
base-10  expression  for  a  nonnegative  integer  no  greater  than  134217728,  the 
code  produces  a  list  of  the  fractions  ijn  for  n  <  i  <  2n  and  their  standard 
<ind  optimal  continued  fraction  expansions.  The  name  arith  refers  to  the 
file  arith. def,  which  contains  definitions  of  standard  constants,  arithmetic 
operations  and  order  relations  on  natural  numbers,  integers  and  rationals. 
The  file  arith. def  is  given  below. 

Comments  in  the  code  use  “term”  to  mean  “partial  quotient” .  The  cf  rac 
function  computes  standard  continued  fractions  and  the  negf  rac  function 
computes  optimum  ones.  The  ncune  negf  rac  refers  to  the  possible  occurence 
of  negative  numbers  as  partial  quotients  in  optimum  continued  fractions. 

FROM  au'ith  IMPORT  izero,ione,ilesseq,iabs 
FROM  arith  IMPORT  iplus.idiff ,imult,idiv 

1 1  Operations  on  integers 

neztterm  p  q  =  t,  ilesseq  (iabs  (idiff  p  (imult  t  q))) 

(iabs  (idiff  p  (imult  tp  q)));  tp 
where  t  =  idiv  p  q 

tp  =  iplus  t  ione,  ilesseq  izero  t; 
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idiff  t  ione 


cfrac  «p,  «ZERO,ZERO»  »  =  [] 

cfrac  «p,q»  =  tl  :  (cfrac  «q,  (idiff  p  (imult  tl  q))») 
where  tl  =  idiv  p  q 

negfrac  «p,  «ZERO,ZERO»  »  =  [] 

negfrac  «p,q»  =  t2  :  (negfrac  «q,  (idiff  p  (imult  t2  q))») 
where  t2  =  nextterm  p  q 

II  Operations  creating  continued  fractions.  They  use  NUMs 
II  to  enumerate  the  basic  possibilities. 

numseq  a  b  =  □,  b<a;  a;(niimseq  (a+1)  b) 

niimtoint  a  =  <<(#a)  ,ZER0>> 

Ipair  □  a  =  [] 

Ipair  (a:l)  b  =  «a,b»:  (Ipair  1  b) 

1  cfrac  []  =  [] 

Icfrac  (<<a,b>>:l)  =  Icfrac  l,cfl*cf2; 

«  «a,b»,("\n"),cfi,("\n"),cf2,("\n")  » 

;  Icfrac  1 
where 

cfl  =  (cfrac  «numtoint  a,  numtoint  b») 
cf2  =  (negfrac  «numtoint  a,  numtoint  b>>) 

getcfrac  n  =  [],n<l;  Icfrac  (Ipair  (numseq  n  (2*n))  n) 


The  file  arith.def  follows.  Its  PROVE  statements  state  mathematical 
facts  that  allow  Clio  to  simplify  many  expressions. 

I  I  Edited  version  of  “howard/testdir/arith.def 

II  Also  includes  code  from  ~mark/oracled/testdir/nat .def 

II  Define  the  natural  numbers. 
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nplus  ZERO  n  =  n 

nplus  (SUCC  n)  m  SUCC  (nplus  n  m) 

ndiff  n  ZERO  =  n 

ndiff  ZERO  n  =  ZERO 

ndiff  (SUCC  m)  (SUCC  n)  =  ndiff  ra  n 

nmult  ZERO  n  =  ( !n)->ZERO;bottom 
nmult  (SUCC  m)  n  =  nplus  (nmult  m  n)  n 

nless  ZERO  (SUCC  m)  =  true 

nless  n  ZERO  =  false 

nless  (SUCC  n)  (SUCC  m)  =  nless  n  m 

nlesseq  n  n  =  true 
nlesseq  n  m  =  nless  n  m 

ndiv  m  ZERO  =  bottom 

ndiv  m  n  =  (nlesseq  (SUCC  m)  n)->ZER0; 

SUCC  (ndiv  (ndiff  m  n)  n) 

II  Define  the  integers.  Integers  are  pairs  of  natursd 
I  I  numbers:  «m,n»  is  m-n.  In  all  except  intermediate 
I  I  calculations,  at  least  one  of  m  and  n  is  ZERO. 

isimp  «ZER0,n»  =  «ZER0,n» 
isimp  <<m,ZER0»  =  «m,ZER0» 
isimp  «(SUCC  m)  ,  (SUCC  n)»  =  isimp  <<m,n» 

izero  =  «ZER0,ZER0» 

ione  =  .«(#1)  ,ZER0» 

iplus  «i,j»  «k,l»  *  isimp  «nplus  i  k,  nplus  j  1» 

idiff  «i,j»  «k,l>>  =  isimp  «nplus  i  1,  nplus  j  k>> 
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imult  «i,j»  «k,l»  * 

isimp  <<(nplus  (nmult  i  k)  (nmult  j  1)), 

(nplus  (nmult  i  1)  (nmult  j  k))» 

idiv  «i,j>>  <<k,k>>  =  bottom 
idiv  «i,7’'RU»  «k,ZERQ»  =  «ndiv  i  k,  ZER0» 
idiv  «ZEnLl,_  >  «ZERQ,1»  =  «ndiv  j  1,  ZER0» 
idiv  «i,ZER0»  «ZER0,1»  =  «ZER0,  ndiv  i  1» 
idiv  «ZER0,j>>  <<k,ZER0>>  =  «ZER0,  ndiv  j  k» 
idiv  X  y  =  idiv  (isimp  x)  (isimp  y) 

iabs  «i,ZER0»  =  «i,ZER0» 
iabs  «ZER0,j»  =  «j,ZER0» 

I  I  Define  relations  on  integers 

ilesseq  <<i,j>>  <<k,l»  =  nlesseq  (nplus  i  i)  (nplus  j  k) 
iter  ZERO  f  s  *  s 

iter  (SUCC  n)  f  s  =  iter  n  f  (f  s) 

PROVE 

'x=(SUCC  x)‘=‘'(!x)‘ 

PROVE 

'nplus  ZERO  n'  =  'n' 

PROVE 

'nplus  (SUCC  n)  m'  ®  'SUCC  (nplus  n  m) ' 

PROVE 

'nplus  m  n'  =  'nplus  n  m' 

PROVE 

'nplus  (nplus  1  m)  n'  =  'nplus  1  (nplus  m  n) ' 

PROVE 

'ndiff  n  ZERO'  =  'n' 

PROVE 

'ndiff  (SUCC  n)  (SUCC  m) '  *  'ndiff  n  m' 

PROVE 
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‘ndiff  n  n'  =  ‘ZERO'  ,  '•n‘=‘true‘ 

PROVE 

‘ndiff  (ndiff  1  m)  n‘  =  ‘ndiff  1  (nplus  m  n) ‘ 
PROVE 

‘nless  n  ZERO'  =  “In' 

PROVE 

‘nless  (SUCC  n)  (SUCC  m) ‘  =  ‘nless  n  m' 

PROVE 

‘nless  n  (SUCC  m) ‘  =  'true', 

('n=m‘=‘true‘  \/  ‘nless  n  m‘=‘true‘) 

PROVE 

‘nless  (SUCC  n)  m'  =  'true', 

(‘nless  n  l‘  =  ‘true'  ft  ‘nless  1  ]n‘  =  ‘true‘) 

PROVE 

‘nless  ZERO  (SUCC  m)  '='  true' 

PROVE 

‘nless  n  ZERO'  =  'false', 'In'  =  'true' 

PROVE 

‘nless  (SUCC  n)  (SUCC  m)  '='  nless  n  m‘ 

PROVE 

‘ndiv  m  ZERO  bottom' 

PROVE 

‘udiv  m  n  '='  (nlesseq  (SUCC  m)  n)->ZER0; 

SUCC  (ndiv  (ndiff  m  n)  n) ‘ 

PROVE 

‘nmult  ZERO  n  '='  ( In)- >ZERO; bottom' 

PROVE 

‘nmult  (SUCC  m)  n  '='  nplus  (nmult  m  n)  n' 
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Appendix  I 

Gosper’s  Algorithm  Code 


This  appendix  contains  C  programs  carrying  out  Gosper’s  algorithm  to  com¬ 
pute  combinations  of  the  number  1  +  \/2  =  [2,2,2,. . .]  with  itself,  and  an 
example  showing  how  subroutines  of  these  programs  can  be  rewritten  to  make 
the  programs  compute  combinations  of  other  continued  fractions.  The  first 
program  computes  standard  continued  fractions  and  uses  a  decision  cube 
for  deciding  whether  it  is  possible  to  determine  and  output  the  next  par¬ 
tial  quotient  of  the  result,  as  proposed  by  Matula  and  Kornerup  in  [KM88]. 
The  second  program  computes  generalized  continued  fractions  and  uses  a 
cruder  algorithm  to  decide  whether  it  can  produce  output.  These  programs 
and  their  outputs  are  discussed  in  Subsection  6.4.2.  Both  programs  use  lEEf' 
double-precision  floating-point  arithmetic  as  implemented  on  a  Sun  3/60  hav¬ 
ing  an  MC68881  coprocessor  with  mask  A93N.  They  both  clear  and  test  the 
inexact  status  flag  by  calling  fpstatus.  under  Release  3.5  of  the  Sun  UNIX 
4.2  operating  system. 


I.l  Standard  Continued  Fractions 

#include  <stdio.h> 

double  a,b,c,d,e,f,g,h,A,B,C,D,E,F,G,H; 
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1 


mainO 

{ 

unsigned  oldstatus,nevstatus,fpstatus_() ; 
double  t ,ta,tb,tc,td,te,tf ,tg,th; 
double  x(),y(),floor(); 

double  vx,vy,vz,p,q,pml,qiol,pm2,qm2,rx() ,ry() ,rz() ; 

nevstatus  =0;  /*  for  clearing  program  status  flags  */ 

(void)  fprintf(stderr, "initialize  coefficient  cube:\n"); 
(void)  fprintf (stderr,"a:  "); 

(void)  scanf  ("Xlf.fta) ; 

(void)  fprintf (stderr,"b:  "); 

(vo id)  scanf ( "Xlf " , ftb) ; 

(void)  fprintf (stderr,"c:  "); 

(void)  scanf ("Xlf",&c) ; 

(void)  fprintf (stderr,"d:  "); 

(void)  scanf  ("y.lf",&d) ; 

(void)  fprintf (stderr,"o:  "); 

(void)  scanf ("%lf" ,fte) ; 

(void)  fprintf (stderr,"f:  "); 

(void)  scanf ("y.lf",ftf); 

(void)  fprintf (stderr,"g:  "); 

(void)  scanf  ("Xlf.ftg) ; 

(void)  fprintf (stderr,"h:  "); 

(void)  scanf  ("'/.If", fth)  ; 

vx  =  rx();  /♦  prepare  values  for  descriptive  output  */ 

vy  =  ryO; 
vz  =  rz(vx,vy); 

A  =  a;  /*  initialize  decision  values  */ 

B  «  a  +  b; 

C  «  a  +  c; 

D=a+b+c+d; 

E  *  e; 

F  *  e  +  f; 
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G  -  e  +  g; 

H»e+f+g+h; 

pm2  *  0;  /*  initialize  convergents  */ 

qni2  =  1; 
pml  ■  1; 
qml  =  0; 

whiled)  { 

if(H  0  1 1  F  ==  0  1 1 

floor (B/F)  !=  floor (D/H))  {  /♦  must  ingest  an  x  */ 
t  =  x(); 

oldstatus  =  fpstatus_(ftnewstatus) ;  /*  clear  flaigs  */ 

ta  *  a;  /*  update  coefficient  values  ♦/ 
tb  *  b; 

tc  »  c; 
td  =  d; 
te  *  e; 
tf  »  f; 
tg  *  g; 
th  «  h; 


a  ■  t*ta  +  tc; 
b  ■  tKtb  +  td; 
c  ■  ta; 
d  ■  tb; 

e  ■  t*te  +  tg; 
f  »  t*tf  +  th; 
g  -  te; 
h  «  tf; 


ta  *  A 
tb  *  B 
tc  ■  C 


/*  update  decision  values  ♦/ 
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td  »  D; 
te  *  E; 
tf  »  F; 
tg  =  G; 
th  ■  H; 

C  =  t*ta  +  tc; 

D  =  t*tb  +  td; 

A  =  C  -  ta; 

B  =  D  -  tb; 

G  *  t*te  +  tg; 

H  ■  t*tf  +  th; 

E  *  G  -  te; 

F  =  H  -  tf; 

oldstatus  =  fpstatus_(&newstatus) ;  /*  test  fleigs  */ 
ifColdstatus  k  512)  { 

(void)  priiitf("\n\ 

Inexact  update  ingesting  x  partial  quotient  */,1.0f\n", 

t); 

exit(l) ; 

} 

(void)  printf("\ 

Ingested  the  x  partial  quotient  %1.0f\n",t); 

> 

else  { 

if(G  ==  0  II  E  ==  0  II 

floor(C/G)  !=  floor (D/H)  II 

floor(A/E)  !=  floor(B/F))  {  /*  must  ingest  ay*/ 
t  »  yO; 

oldstatus  =  fpstatu8_(ftnevstatus) ; 

ta  =  a;  /♦  update  coefficient  values  */ 

tb  =  b; 
tc  *  c; 
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td  =  d; 
te  =  e; 
tf  =  f; 
tg  =  g; 
th  =  h; 

a  =  t*ta  +  tb; 
b  =  ta; 

c  =  t*tc  +  td; 
d  =  tc; 

e  =  t*te  +  tf; 
f  =  te; 

g  =  t*tg  +  th; 
h  =  tg; 

ta  »  A;  /*  update  decision  values  */ 

tb  =  B; 

tc  =  C; 

td  =  D; 

te  ==  E; 

tf  *  F; 

tg  =  G; 

th  =  H; 

B  =  t*ta  +  tb; 

A  =  B  -  ta; 

D  «  t*tc  +  td; 

C  ■  D  -  tc; 

F  *  t*te  +  tf; 

E  =  F  -  te; 

H  »  t*tg  +  th; 

G  ■  H  -  tg; 

oldstatus  «  fpstatus.Cftnewstatus) ; 
ifColdstatus  t  512)  { 

(void)  printf(*'\n\ 

Inexact  update  ingesting  y  partial  quotient  '/,1.0f\n". 
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t); 

exit(l) ; 

} 

(void)  printf("\ 

Ingested  the  y  partial  quotient  Xl.0f\n",t); 

} 

else  •(  /*  can  output  a  partieJ.  quotient  ♦/ 

t  =  f loor(A/E) ; 

oldstatus  «  fpstatus_(ftne«status) ; 

ta  *  a;  /*  update  coefficient  values  ♦/ 

tb  «  b; 

tc  =  c; 

td  =  d; 

te  =  e; 

tf  =  f; 

tg  =  g: 

th  =  h; 

a  »  te; 
b  =  tf; 
c  =  tg; 
d  =  th; 

e  =  ta  -  t*te; 
f  *  tb  -  t*tf; 
g  »  tc  -  t*tg; 
h  ®  td  -  t*th; 


ta  =  A 
tb  =  B 
tc  =  C 
td  =  D 
te  =  E 
tf  =  F 
tg  =  G 
th  -  H 


/*  update  decision  values  */ 
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A  ■  te; 

B  -  tf; 

C  “  tg; 

D  ■  th; 

E  ■  ta  -  t*te; 

F  *  tb  -  t*tf; 

G  =  tc  -  t*tg; 

H  *  td  -  t*th; 

oldstatus  ■  fpstattts.Cftnewstatus) ; 
ifColdstatus  k  512)  { 

(void)  printf(''\n\ 

Inexact  update  outputting  partial  quotient  Xl.Of\n", 

t); 

ezit(l) : 

} 

(void)  printf(''\n\ 

Output  the  partial  quotient  Ki.Of\n\n'', 
t); 

(void)  printf(''\ 

Coefficient  cube  after  output :\n\n\ 


a  —  X-20.0f 

e  —  X-20.0f\n\ 

b  —  X-20.0f 

f  “  X-20.0f\n\ 

c  —  %-20.0f 

g  —  X-20.0f\n\ 

d  --  1-20. Of 

h  —  X-20.0f\n\n", 

a.e.b.f ,c,g,d,h); 


p  «  t*pml  +  pn2;  /♦  update  convergents  */ 

q  ■  toqnl  +  qm2; 

pm2  *  pnl; 

qm2  *■  qnl; 

pml  -  p; 

qml  »  q; 
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(void)  printf("\ 

New  approximation:  X20.16e\n\ 
True  result :  X20 . 16e\n\n" , 

p/q,vz) ; 

} 

} 

} 


double  rz(u,v) 
double  u.vi 

return ((a  +  b/v 
} 


/*  Written  to  evaluate  correctly  with  */ 

/*  POSINF  as  one  or  both  arguments  */ 

+  c/u  +  d/(u*v))/(e  +  f/v  +  g/u  +  h/(u*v))); 


/♦  Code  for  inputs  */ 

double  x() 

retum(2) ; 

} 

double  rz() 

double  sqrtO; 

retumCl.O  +  sqrt(2.0)); 

} 


double  y() 

return (2) ; 

} 
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’•■•iUr  '.t 


double  ryO 

{ 

double  sqrt () ; 

returnd.O  +  sqrt  (2.0)); 

> 


1.2  Generalized  Continued  Fractions 

tinclude  <8tdio.h> 

unsigned  oldstatus,newstatus,fpstatus_() ; 

double  a,b,c,d,e,f ,g,h; 
double  p,q,pml,qml,pm2,qm2; 
double  fflin,max,deltax,deltay; 

double  P0SINF,valueC3]; 

DainO 

void  ingestzO  .ingestyO  ,outputz()  ,getlims() ; 
double  vx,vy,vz,rx() ,ry() ,rz(); 

nevstatus  *  0;  /*  for  clearing  program  status  flags  */ 

vx  ■  0.0;  /*  for  finding  limit  values  */ 

POSIMF  -  1/vx; 
value [0]  ■  -1.0; 
valueCl]  ■  1.0; 
value [2]  -  POSINF; 

(void)  fprintf(8tderr," initialize  coefficient  cube:\n"); 
(void)  fprintf (stderr,"a:  "); 
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(void)  8canf("Xlf'',*a); 

(void)  fprintf(stderr,"b;  '•); 

(void)  scanf("Xlf'',Ab); 

(void)  fprintf  (stderr,"c:  '■); 

(void)  scanfC'Xlf'.Ac); 

(void)  fprintf  (stderr,''d:  "); 

(void)  scanfC'Xlf'.id); 

(void)  fprintf  (8tderr,''e:  ”)  ; 

(void)  scanfC'Xlf'.fte); 

(void)  fprintf (atderr/'f:  "); 

(void)  scanfC'Xlf'.tf); 

(void)  fprintf (8tderr,"g:  "); 

(void)  scanf("Xlf'',ftg); 

(void)  fprintf (8tderr,"h:  "); 

(void)  scanf("Xlf",*h); 

*  r*() i  /*  prepare  vedues  for  descriptive  output  */ 
vy  =  ryO; 
vz  a  rz(vx,vy); 

P“2  *  0;  /*  initialize  convergents  ♦/ 

qm2  «  1; 
pml  »  1; 
qml  *  0; 

ingestxO ; 
ingestyO ; 

while(l)  { 
getlimsO ; 

if (max  -  min  <  0.5)  { 
outputzO ; 

(void)  printf("\ 

Coefficient  cube  after  output ;\n\n\ 
a  —  X-20.0f  e  —  X-20.0f\n\ 
b  -  X-20.0f  f  -  X-20.0f\n\ 
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C  —  X-20.0f  g  —  X-20.0f\n\ 
d  -  X-20.0f  h  -  y.-20.0f\n\n'', 
a.e.b.f.c.g.d.h); 

(void)  printf("\ 

New  approximation:  %20.16e\n\ 
True  result:  X20.16e\n\n". 

p/q.vz); 

} 

else  { 

ifCdeltax  >=  deltay) 
ingestxO ; 
else 

ingestyO; 

} 

} 

} 


void  ingestxO 

{ 


double  t,ta,tb,tc,td,te,tf,tg.th; 
double  xO; 

t  -  xO; 

oldstatus  =  fpstatus_(toewstatus);  /*  dear  flags  */ 
ta  »  a;  /♦  update  coefficient  values  ♦/ 


tc  ■  c 
td  -  d 
te  ■  e 
tf  ■  f; 

H  •  g; 
th  ■  h; 


a  ■  t*ta  +  tc; 
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b  =  t*tb  +  td; 
c  »  ta; 
d  «  tb; 

e  ■  t*te  +  tg; 
f  «  t*tf  +  th; 
g  *  te; 
b  =  tf; 

oldstatus  *  fpstatus_(ftnewstatus) ;  /*  test  flags  */ 
ifColdstatus  ft  512)  { 

(void)  printf("\n\ 

Inexact  update  ingesting  x  pairtial  quotient  Xl.OfNn" 
t); 

exit(l) ; 

} 

(void)  printf("\ 

Ingested  the  x  partial  quotient  Xl.0f\n",t); 


void  ingestyO 

{ 

double  t ,ta,tb,tc,td,te,tf ,tg,th; 
double  y() ; 

t  -  yO; 

oldstatus  *  fpstatus_(ftnewstatus) ; 


ta  =  a 
tb  =  b 
tc  ■  c 
td  ■  d 
te  *  e 
tf  “  f 

•  8 

th  ■  h 


/*  update  coefficient  vsdues  */ 
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a  *  t*ta  +  tb; 
b  “  ta; 

c  =  t*tc  +  td; 
d  =  tc; 

e  =  t*te  +  tf; 
f  ®  te; 

g  »  t*tg  +  th; 
b  *  tg; 

oldstatus  -  f pstatus^(lniewstatus) ; 
ifColdstatus  *  512)  { 

(void)  printf("\n\ 

Inexact  update  ingesting  y  partial  quotient  )jl.0f\n", 
t); 

exit(l) ; 

} 

(void)  printf(''\ 

Ingested  the  y  partial  quotient  ’/.I  .Of\n"  ,t)  ; 

} 


void  outputzO 

{ 

double  t ,ta,tb,tc,td,te,tf .tg.th; 
double  ceilO  ,floor() ; 

t  ■  (min+Bax)/2.0; 
ta  •  floor(t) ; 
tb  »  ceil(t) ; 

if(t  -  ta  <  tb  -  t) 
t  »  ta; 
else 

t  ■  tb; 

oldstatus  «  fpstatus.(ftnew8tatus) ; 
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/♦  update  coefficient  veilues  ♦/ 


ta 

»  a 

tb 

■  b 

tc 

■  c 

td 

=  d 

te 

=  e 

tf 

=  f 

tg 

“  g 

th 

=  h 

a  =  te 

b  ■  tf 

c  =  tg 

d  = 

=  th 

e  =  ta  -  t*te; 
f  =  tb  -  t*tf; 
g  =  tc  -  t*tg; 
h  =  td  -  t*th; 

oldstatus  *  fpstatus_(4newstatus) ; 
ifColdstatus  &  512)  { 

(void)  printf("\n\ 

Inexact  update  outputting  partial  quotient  5Cl.0f\n", 
t) ; 

exit(l) ; 

> 

(void)  printf("\n\ 

Output  the  partial  quotient  y,1.0f\n\n", 
t); 

p  =  t*pml  +  pni2;  /*  update  convergents  */ 

q  ®  t*qml  +  qm2; 

pm2  pml; 

qm2  =  qml; 

pml  «  p; 

qml  *  q; 

} 
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void  getlimsO 

{ 

int  i,j; 

double  temp, limits [3] [3]  ,minx,m2ucx,miny,maxy,rz() ; 

min  =  POSINF; 
max  =  -POSINF; 
for(i=0;  i<3;  ++i)  { 
for(j«0;  j<3;  ++j)  { 

temp  =  rzCvalueCi] ,value[j]) ; 
limitsCi][j]  =  temp; 
if (temp  <  min) 
min  *  temp; 
if (temp  >  max) 
max  s  temp; 

} 

} 

deltax  =  0.0; 
for(j=0;  j<3;  ++j)  { 
minx  ■  POSINF; 
maxx  «  -POSINF; 
for(i«0;  i<3;  ++i)  { 
temp  »  limits [i] [j] ; 
if (temp  <  minx) 
minx  >  temp; 
if (temp  >  maxx) 
maxx  «  temp; 

} 

temp  ■  maxx  -  minx; 
if (temp  >  deltax) 
delteuc  «  temp; 

} 

deltay  «  0.0; 
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for(i»0;  i<3;  ++i)  { 
miny  ■  POSINF; 
aaxy  ■  -POSINF; 
for(j»0;  j<3;  ++j)  { 
temp  “  limits [i] [j] ; 
if (temp  <  miay) 
miny  »  temp; 
if (temp  >  mazy) 
mazy  «  temp; 

} 

temp  «  maizy  -  miny; 
if (temp  >  deltay) 
deltay  «  temp; 

} 

} 


double  rz(u,v) 
double  u,v; 

< 

return ((a  +  b/v 
} 


/♦  Written  to  evaluate  correctly  with  */ 

/♦  +-  POSINF  as  one  or  both  arguments  */ 

+  c/u  +  d/(u*v))/(e  +  f/v  +  g/u  +  h/(u*v))); 


/♦  Code  for  inputs  */ 

double  z() 

retum(2) ; 

} 

double  rz() 

double  sqrt () ; 

retum(1.0  +  sqrt  (2.0)); 

} 
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double  y() 

retum(2) ; 

> 

double  ryO 

double  sqrt  () ; 

returnCl.O  +  sqrt (2.0)); 

> 


1.3  Sample  Modifications 

double  x() 

static  int  index  »  0; 
double  q; 

if (index  «  0) 
q  «  2.0; 

else  if (index  «  1) 
q  »  1.0; 
else  { 

if ((index-2 )X3  ■«  0) 

q  •  2.0*(1.0  +  (((double)  index)  -  2.0)/3.0) 
else 

q  ■  1.0; 

} 

++index; 
return (q) ; 

} 
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double  rx() 

{ 

double  expO; 


returnCexpCl.o)) 
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